## Set Up Neo4j Environment

Start by creating Neo4jGraph instance

In [2]:
import os
from dotenv import load_dotenv
from langchain_community.graphs import Neo4jGraph


# Load from environment
load_dotenv('.env', override=True)
NEO4J_URI = os.getenv('NEO4J_URI')
NEO4J_USERNAME = os.getenv('NEO4J_USERNAME')
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD')
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
graph = Neo4jGraph(refresh_schema=False)

## Chunk Text

In the microsoft paper, it is recommended to use smaller chunk sizes (~600) to extract more entities overall. The steps for this section are:
- Import PDFs for this example (lets use 2 papers)
- Convert PDF to markdown or txt files (use matextract workflow as basis)
- Chunk text from articles (will have to explore best way to do this)

Covert pdf to md using marker:

In [3]:
#! marker_single data/pdf/10.1038s41586-019-1798-7.pdf data/md --langs English
! marker data/pdf/10-DOIs-for-KG data/md --workers 10 

/bin/bash: line 1: marker_single: command not found


Clean up markdown file, removing sections, etc. For this example, I'll do a more simple approach and use just one .md file. We'll take the text and save it as a single chunk.

To scale this workflow, we'd use multiple chunks (from multiple papers), instead of just one. We could use a vector database to store a lot of chunks, and retrieve the most relevant ones.

ChemNLP offers a more thorough workflow for cleaning up .md files generated from scholarly articles.

In [3]:
import re

def clean_text(text):
    # Delete the pattern [MISSING_PAGE_FAIL:x]
    cleaned_text = re.sub(r"\[MISSING_PAGE_FAIL:\d+\]", "", text)

    # Delete the acknowledgements section
    cleaned_text = re.sub(
        r"## Acknowledgements.*?(?=##|$)", "", cleaned_text, flags=re.S
    )

    # delete the references section 
    cleaned_text = re.sub(r"## *Notes And References.*", "", cleaned_text, flags=re.S)

    return cleaned_text



input_file = "data/md/10.1039c0dt00999g/10.1039c0dt00999g.md"

with open(input_file, "r", encoding="utf-8") as f:
    content = f.read()

# clean the text
text = clean_text(content)
print(text)

This article is published as part of the *Dalton Transactions* **themed issue entitled:** 

# New Talent Asia

Highlighting the excellent work being carried out by younger members of the inorganic academic community in Asia Guest Editor Masahiro Yamashita Tohoku University, Japan Published in issue 10, 2011 of *Dalton Transactions*

![0_image_0.png](0_image_0.png)

Image reproduced with permission of Kenneth Kam-Wing Lo Articles in the issue include: PERSPECTIVES: Pyrazolin-4-ylidenes: a new class of intriguing ligands Yuan Han and Han Vinh Huynh, *Dalton Trans*., 2011, DOI: 10.1039/C0DT01037E Solvent induced molecular magnetic changes observed in single-crystal-to-single-crystal transformation Zheng-Ming Hao and Xian-Ming Zhang, *Dalton Trans*., 2011, DOI: 10.1039/C0DT00979B, ARTICLES: 
Negative thermal expansion emerging upon structural phase transition in ZrV2O7 and HfV2O7 Yasuhisa Yamamura, Aruto Horikoshi, Syuma Yasuzuka, Hideki Saitoh and Kazuya Saito Dalton Trans., 2011, DOI: 10

Document specific text chunking strategies will be used - ie. Langchains "MarkdownTextSplitter"

Similar to recursive text splitting, but a step up.

In [4]:
from langchain.text_splitter import MarkdownTextSplitter
splitter = MarkdownTextSplitter(chunk_size = 2000, chunk_overlap=0)

In [5]:
chunks = splitter.create_documents([text])

In [6]:
chunks[0]

Document(page_content='This article is published as part of the *Dalton Transactions* **themed issue entitled:** \n\n# New Talent Asia\n\nHighlighting the excellent work being carried out by younger members of the inorganic academic community in Asia Guest Editor Masahiro Yamashita Tohoku University, Japan Published in issue 10, 2011 of *Dalton Transactions*\n\n![0_image_0.png](0_image_0.png)\n\nImage reproduced with permission of Kenneth Kam-Wing Lo Articles in the issue include: PERSPECTIVES: Pyrazolin-4-ylidenes: a new class of intriguing ligands Yuan Han and Han Vinh Huynh, *Dalton Trans*., 2011, DOI: 10.1039/C0DT01037E Solvent induced molecular magnetic changes observed in single-crystal-to-single-crystal transformation Zheng-Ming Hao and Xian-Ming Zhang, *Dalton Trans*., 2011, DOI: 10.1039/C0DT00979B, ARTICLES: \nNegative thermal expansion emerging upon structural phase transition in ZrV2O7 and HfV2O7 Yasuhisa Yamamura, Aruto Horikoshi, Syuma Yasuzuka, Hideki Saitoh and Kazuya Sa

### Extracting Nodes and Relationships

Could use two approaches:
- Loop through each chunk with the LLM to extract the KG info (nodes,relationships, properties)
- Employ a vector databse and do similarity search based off a query, then only look through those chunks

I will begin with the first approach. If its not too time consuming/expensive, I think it would be best.

LLMGraphTransformer from langchain will be used to extract entities from unstructured text. Providing a predefined schema is optional. However, you cannot provide a description/more context related to your schema. 

I think it would be useful to provide some MOF-related context to ensure the LLM knows what it is looking for. Maybe modifying the system prompt in the Langchain code base could help. We could pass an optional "schema description" variable where we provide additional context. We just have to be careful about exceeding the context window of our LLM. If possible, maybe we could pass an optional description for each part of a schema, ie. you can choose to further describe a Node type or relationship, so the LLM better understands what to look for.

In [7]:
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0, model_name="gpt-4o-mini")
llm_transformer = LLMGraphTransformer(
  llm=llm, 
  allowed_nodes=["MOF", "Bond", "Atom", "Metal", "Linker"],
  allowed_relationships=["Has_Bond", "Has_Atom", "Has_Linker"],
)

In [8]:
from typing import List
from langchain_community.graphs.graph_document import GraphDocument
from langchain_core.documents import Document
def process_document(doc: Document) -> List[GraphDocument]:
    return llm_transformer.convert_to_graph_documents([doc])

In [38]:
# Process each chunk using the process_text function
graph_documents = []
for chunk in chunks:
    graph_document = process_document(chunk)
    graph_documents.extend(graph_document)

# Print or handle the processed graph_documents as needed
print(f"Processed {len(graph_documents)} GraphDocuments.")

Processed 20 GraphDocuments.


In [39]:
graph_documents

[GraphDocument(nodes=[Node(id='Dalton Transactions', type='Mof')], relationships=[], source=Document(page_content='This article is published as part of the *Dalton Transactions* **themed issue entitled:** \n\n# New Talent Asia\n\nHighlighting the excellent work being carried out by younger members of the inorganic academic community in Asia Guest Editor Masahiro Yamashita Tohoku University, Japan Published in issue 10, 2011 of *Dalton Transactions*\n\n![0_image_0.png](0_image_0.png)\n\nImage reproduced with permission of Kenneth Kam-Wing Lo Articles in the issue include: PERSPECTIVES: Pyrazolin-4-ylidenes: a new class of intriguing ligands Yuan Han and Han Vinh Huynh, *Dalton Trans*., 2011, DOI: 10.1039/C0DT01037E Solvent induced molecular magnetic changes observed in single-crystal-to-single-crystal transformation Zheng-Ming Hao and Xian-Ming Zhang, *Dalton Trans*., 2011, DOI: 10.1039/C0DT00979B, ARTICLES: \nNegative thermal expansion emerging upon structural phase transition in ZrV2O

### Changing Langchain's source code to allow for optional "additional context" prompt

In [9]:
context = ''' 
### **Node Types:**
- **MOF (Metal-Organic Framework):** Refers to compounds consisting of metal ions or clusters coordinated to organic ligands.
- **Bond:** Represents a connection between two atoms within a MOF.
- **Atom:** The basic unit of a chemical element in a MOF.
- **Metal:** A chemical element forming positive ions and involved in the MOF's structure.
- **Linker:** An organic molecule connecting metal ions or clusters in a MOF.

### **Relationship Types:**
- **Has_Bond:** Links an "Atom" to another "Atom" via a bond within a MOF.
- **Has_Atom:** Indicates that a "MOF" contains a specific "Atom".
- **Has_Linker:** Indicates that a "MOF" contains a specific "Linker".

### **Important Guidelines:**
1. **Scientific Context:** Interpret terms within the context of chemistry. For example, "Atom" should refer to elements within a MOF, not other uses of the word.
2. **Disambiguation:** If a term could be ambiguous, prefer the scientific interpretation. Ignore non-scientific entities or terms unless directly relevant to MOFs.
3. **Entity Consistency:** Ensure consistent naming for entities. For example, always use the full name of a MOF or a chemical element even if it appears in a shortened form in the text.
4. **Domain-Specific Instructions:** Use technical jargon or abbreviations only within the context of chemistry, and classify them correctly.
5. **Filtering Non-Relevant Content:** Ignore or deprioritize non-scientific text or sections irrelevant to the chemistry-specific nodes and relationships.
6. **Known Misinterpretations:** Do not classify journal names, like "Dalton Transactions," as a scientific entity like "MOF."

### **Examples of Correct Classifications:**
- "The MOF Zn-BTC has a bond between Zinc (Metal) and Benzene Tricarboxylate (Linker)."
  - `{"head": "Zn-BTC", "head_type": "MOF", "relation": "Has_Bond", "tail": "Zinc", "tail_type": "Metal"}`
  - `{"head": "Zn-BTC", "head_type": "MOF", "relation": "Has_Linker", "tail": "Benzene Tricarboxylate", "tail_type": "Linker"}`

'''

In [10]:
from custom_llm import LLMGraphTransformer as CustomLLMGraphTransformer
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0, model_name="gpt-4o-mini")
llm_transformer = CustomLLMGraphTransformer(
  llm=llm, 
  context = context,
  allowed_nodes=["MOF", "Bond", "Atom", "Metal", "Linker"],
  allowed_relationships=["Has_Bond", "Has_Atom", "Has_Linker"],
  node_properties=["description"],
  relationship_properties=["description"]
)

In [11]:
from typing import List
from langchain_community.graphs.graph_document import GraphDocument
from langchain_core.documents import Document
def process_document(doc: Document) -> List[GraphDocument]:
    return llm_transformer.convert_to_graph_documents([doc])

In [12]:
# Process each chunk using the process_text function
custom_graph_documents = []
for chunk in chunks:
    custom_graph_document = process_document(chunk)
    custom_graph_documents.extend(custom_graph_document)

# Print or handle the processed graph_documents as needed
print(f"Processed {len(custom_graph_documents)} GraphDocuments.")

Processed 20 GraphDocuments.


In [13]:
custom_graph_documents

[GraphDocument(nodes=[Node(id='Dalton Transactions', type='Mof', properties={'description': 'A journal that publishes research in inorganic and organometallic chemistry.'})], relationships=[Relationship(source=Node(id='Dalton Transactions', type='Mof'), target=Node(id='New Talent Asia', type='Mof'), type='HAS_LINKER', properties={'description': 'Themed issue highlighting younger members of the inorganic academic community in Asia.'})], source=Document(page_content='This article is published as part of the *Dalton Transactions* **themed issue entitled:** \n\n# New Talent Asia\n\nHighlighting the excellent work being carried out by younger members of the inorganic academic community in Asia Guest Editor Masahiro Yamashita Tohoku University, Japan Published in issue 10, 2011 of *Dalton Transactions*\n\n![0_image_0.png](0_image_0.png)\n\nImage reproduced with permission of Kenneth Kam-Wing Lo Articles in the issue include: PERSPECTIVES: Pyrazolin-4-ylidenes: a new class of intriguing ligan

In [14]:
graph.add_graph_documents(
    custom_graph_documents,
    baseEntityLabel=True,
    include_source=True
)

Let's see how many nodes and relationships were extracted

In [15]:
graph.query("""
MATCH (n:`__Entity__`)
RETURN "node" AS type,
       count(*) AS total_count,
       count(n.description) AS non_null_descriptions
UNION ALL
MATCH (n)-[r:!MENTIONS]->()
RETURN "relationship" AS type,
       count(*) AS total_count,
       count(r.description) AS non_null_descriptions
""")

[{'type': 'node', 'total_count': 118, 'non_null_descriptions': 21},
 {'type': 'relationship', 'total_count': 127, 'non_null_descriptions': 13}]

## Entity Resolution

We will now remove duplicate entities from the knowlede graph.
More research should be done on entity resolution techniques - as we can hope to resolve named entity linking issues in this step?

The article uses a four step approach for entity resolution:
1) Entities in the graph - start with all entities in the graph
2) K-nearest graph - Construct a k-nearest neighbor graph, connecting similar entities based on text embeddings
3) Weakly Connected Components — Identify weakly connected components in the k-nearest graph, grouping entities that are likely to be similar. Add a word distance filtering step after these components have been identified
4) LLM evaluation — Use an LLM to evaluate these components and decide whether the entities within each component should be merged, resulting in a final decision on entity resolution (for example, merging ‘Silicon Valley Bank’ and ‘Silicon_Valley_Bank’ while rejecting the merge for different dates like ‘September 16, 2023’ and ‘September 2, 2023’)

Could an approach like this be used resolve entity linking isses for MOFs? - Probably not

Begin by calculating text embeddings for the names of entities:

In [16]:
from langchain_community.vectorstores import Neo4jVector
from langchain_openai import OpenAIEmbeddings

vector = Neo4jVector.from_existing_graph(
    OpenAIEmbeddings(),
    node_label='__Entity__',
    text_node_properties=['id', 'description'],
    embedding_node_property='embedding'
)

Use cosine distance of embeddings to find potentially similar candidates. Will use graph algorithms available in Graph Data Science (GDS) package.

First, we have to import an in-memory graph to execute graph algorithms.

In [17]:
from graphdatascience import GraphDataScience
# project graph

gds = GraphDataScience(
    os.environ["NEO4J_URI"],
    auth=(os.environ["NEO4J_USERNAME"], os.environ["NEO4J_PASSWORD"])
)

The Neo4j stoed graph is projected into an in-memory graph for faster processing and analysis. Graph algorithms are executed on the in-memory graph. The results can optionally be stored back into the Neo4j database.

In [20]:
G, result = gds.graph.project(
    "entities2h0645",                   #  Graph name
    "__Entity__",                 #  Node projection
    "*",                          #  Relationship projection
    nodeProperties=["embedding"]  #  Configuration parameters
)

Creating a k-nearest graph, will create relationships between entities only if the similarity is above a cuttoff (0.95). Use the "mutate" mode of the algorithm to store results back to the projected in-memory graph instead of the knowledge graph.

In [21]:
similarity_threshold = 0.7 #strategy - use a very low threshold, and rely on LLM more to filter out which entities should not be merged

gds.knn.mutate(
  G,
  nodeProperties=['embedding'],
  mutateRelationshipType= 'SIMILAR',
  mutateProperty= 'score',
  similarityCutoff=similarity_threshold
)

ranIterations                                                             5
nodePairsConsidered                                                   36049
didConverge                                                            True
preProcessingMillis                                                       0
computeMillis                                                           298
mutateMillis                                                             55
postProcessingMillis                                                      0
nodesCompared                                                           118
relationshipsWritten                                                   1180
similarityDistribution    {'min': 0.9015045166015625, 'p5': 0.9166793823...
configuration             {'mutateProperty': 'score', 'jobId': '2fa06ac2...
Name: 0, dtype: object

In [22]:
gds.wcc.write(
    G,
    writeProperty="wcc", #using Weakly connected components algorithm
    relationshipTypes=["SIMILAR"]
)

writeMillis                                                             14
nodePropertiesWritten                                                  118
componentCount                                                           1
componentDistribution    {'min': 118, 'p5': 118, 'max': 118, 'p999': 11...
postProcessingMillis                                                     2
preProcessingMillis                                                      0
computeMillis                                                            7
configuration            {'writeProperty': 'wcc', 'jobId': 'dd2abe90-1e...
Name: 0, dtype: object

Text embedding comparison is a starting point, but only part of entity resolution. For example, Google and Apple are very close in the embedding space (0.96 cos similarity).

Therefore, we add an additional fiilter allowing only pairs of words with a text distance of three or fewer (meaning that only the characters can be changed).

This step prevents the named entity linking problem to be resolved with this method. As for that use-case - we might want "MOF-801" and "Compound 1" to be resolved.

In [24]:
word_edit_distance = 3
potential_duplicate_candidates = graph.query(
    """MATCH (e:`__Entity__`)
    WHERE size(e.id) > 4 // longer than 4 characters
    WITH e.wcc AS community, collect(e) AS nodes, count(*) AS count
    WHERE count > 1
    UNWIND nodes AS node
    // Add text distance
    WITH distinct
      [n IN nodes WHERE apoc.text.distance(toLower(node.id), toLower(n.id)) < $distance | n.id] AS intermediate_results
    WHERE size(intermediate_results) > 1
    WITH collect(intermediate_results) AS results
    // combine groups together if they share elements
    UNWIND range(0, size(results)-1, 1) as index
    WITH results, index, results[index] as result
    WITH apoc.coll.sort(reduce(acc = result, index2 IN range(0, size(results)-1, 1) |
            CASE WHEN index <> index2 AND
                size(apoc.coll.intersection(acc, results[index2])) > 0
                THEN apoc.coll.union(acc, results[index2])
                ELSE acc
            END
    )) as combinedResult
    WITH distinct(combinedResult) as combinedResult
    // extra filtering
    WITH collect(combinedResult) as allCombinedResults
    UNWIND range(0, size(allCombinedResults)-1, 1) as combinedResultIndex
    WITH allCombinedResults[combinedResultIndex] as combinedResult, combinedResultIndex, allCombinedResults
    WHERE NOT any(x IN range(0,size(allCombinedResults)-1,1)
        WHERE x <> combinedResultIndex
        AND apoc.coll.containsAll(allCombinedResults[x], combinedResult)
    )
    RETURN combinedResult
    """, params={'distance': word_edit_distance})
potential_duplicate_candidates

[{'combinedResult': ['Urotropin', 'Urotropine']},
 {'combinedResult': ['Zn(3)', 'Zn(4)']},
 {'combinedResult': ['Zn4(Dmf)(Ur)2(Ndc)4', '[Zn4(Dmf)(Ur)2(Ndc)4]']},
 {'combinedResult': ['Compound 1', 'Compound 2']},
 {'combinedResult': ['Framework', 'Framework 2']},
 {'combinedResult': ['Guest Molecule', 'Guest Molecules']},
 {'combinedResult': ['N Atom', 'O Atoms']},
 {'combinedResult': ['Type A', 'Type B']},
 {'combinedResult': ['Wr2 = 0.2833', 'Wr2 = 0.2973']},
 {'combinedResult': ['1.023', '1.037', '1.097']},
 {'combinedResult': ['R1 = 0.0349',
   'R1 = 0.0367',
   'R1 = 0.0374',
   'R1 = 0.0423',
   'R1 = 0.0482',
   'R1 = 0.0492']},
 {'combinedResult': ['Wr2 = 0.1271', 'Wr2 = 0.1311']},
 {'combinedResult': ['Wr2 = 0.0672', 'Wr2 = 0.0879']},
 {'combinedResult': ['Wr2 = 0.0706', 'Wr2 = 0.0916']}]

The merged entities shown above are all incorrect, sadly. In truth, we want all these entities to be distinct. This presents a clear problem with this form of entity resoltuion. Since we have a set of predefined nodes, we can look into setting rules/heuristics that tailor to each node type.

For now, we will continue this approach. We will use an LLM to make a final decision about whether or not entities should be merged.

In [25]:
#this prompt can be more MOF specific
#could feed in the descriptions to help the LLM decide?
system_prompt = """You are a data processing assistant. Your task is to identify duplicate entities in a list and decide which of them should be merged.
The entities might be slightly different in format or content, but essentially refer to the same thing. Use your analytical skills to determine duplicates.

Here are the rules for identifying duplicates:
1. Entities with minor typographical differences should be considered duplicates.
2. Entities with different formats but the same content should be considered duplicates.
3. Entities that refer to the same real-world object or concept, even if described differently, should be considered duplicates.
4. If it refers to different numbers, dates, or products, do not merge results
"""
user_template = """
Here is the list of entities to process:
{entities}

Please identify duplicates, merge them, and provide the merged list.
"""

In [26]:
from langchain_core.pydantic_v1 import BaseModel, Field
from typing import List, Optional
from langchain_core.prompts import ChatPromptTemplate
from retry import retry

class DuplicateEntities(BaseModel):
    entities: List[str] = Field(
        description="Entities that represent the same object or real-world entity and should be merged"
    )


class Disambiguate(BaseModel):
    merge_entities: Optional[List[DuplicateEntities]] = Field(
        description="Lists of entities that represent the same object or real-world entity and should be merged"
    )


extraction_llm = ChatOpenAI(model_name="gpt-4o").with_structured_output(
    Disambiguate
)

extraction_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            system_prompt,
        ),
        (
            "human",
            user_template,
        ),
    ]
)

In [27]:
extraction_chain = extraction_prompt | extraction_llm


def entity_resolution(entities: List[str]) -> Optional[List[List[str]]]:
    return [
        el.entities
        for el in extraction_chain.invoke({"entities": entities}).merge_entities
    ]

In [31]:
from tqdm import tqdm
import time

merged_entities = []

# Process each element sequentially without parallelization
for el in tqdm(potential_duplicate_candidates, desc="Processing documents"):
    # Call entity_resolution function directly
    to_merge = entity_resolution(el['combinedResult'])
    if to_merge:
        merged_entities.extend(to_merge)


Processing documents:   0%|          | 0/14 [00:00<?, ?it/s]

Processing documents: 100%|██████████| 14/14 [00:06<00:00,  2.14it/s]


In [32]:
merged_entities

[['Urotropin', 'Urotropine'],
 ['Zn(3)', 'Zn(4)'],
 ['Zn4(Dmf)(Ur)2(Ndc)4', '[Zn4(Dmf)(Ur)2(Ndc)4]'],
 ['Framework', 'Framework 2'],
 ['Guest Molecule', 'Guest Molecules']]

Take results from entity_resolution and merge them back into the database

In [38]:
graph.query("""
UNWIND $data AS candidates
CALL {
  WITH candidates
  MATCH (e:__Entity__) WHERE e.id IN candidates
  RETURN collect(e) AS nodes
}
CALL apoc.refactor.mergeNodes(nodes, {properties: {
    `.*`: 'discard'
}})
YIELD node
RETURN count(*)
""", params={"data": merged_entities})

[{'count(*)': 5}]

In [39]:
G.drop()

Unnamed: 0,graphName,database,databaseLocation,memoryUsage,sizeInBytes,nodeCount,relationshipCount,configuration,density,creationTime,modificationTime,schema,schemaWithOrientation
