# 06.2 - Knowledge Graph creation with LangChain

This notebook demonstrates how to automatically create a knowledge graph from research paper abstracts LangChain and LLMs.

## 01 - Setup

Ensure the necessary modules are installed and up to date.

%pip install --upgrade --quiet  langchain langchain-community langchain-experimental langchain-ollama json_repair

Local LLMs are accessed via Ollama in this notebook. Ollama can be installed from https://ollama.com/.

## 02 - Data Preparation

For creating a knowledge graph with LangChain, we need to prepare the data by loading the research papers into LangChain Document objects.

In this step, we can either load the MIDAS papers that our LLM has classified as modeling papers (5700+ documents), or we can load the training papers from the MIDAS dataset (46). For starting, the latter, smaller dataset is recommended.

Load the classified papers (5700+ documents):

In [1]:
import pandas as pd
from langchain_core.documents import Document

df_modeling_papers = pd.read_json("../data/modeling_papers_0.json", orient="records", lines=True)

documents = []

for row in df_modeling_papers.itertuples():
    documents.append(Document(id=row.id, page_content=row.abstract))

f"Papers loaded: {len(documents)}"

'Papers loaded: 5737'

Alternatively, load the training papers (46 documents):

In [2]:
import json
from langchain_core.documents import Document

with open("../data/training_modeling_papers.json", "r") as f:
    data = json.load(f)

training_documents = []

for row in data:
    training_documents.append(Document(page_content=row["abstract"]))

f"Papers loaded: {len(documents)}"

'Papers loaded: 5737'

## 03 - Create a Knowledge Graph

LangChain provides direct, experimental support for creating knowledge graphs from documents using LLMs. This is done by using the `LLMGraphTransformer` class.

Documentation on using LLMGraphTransformer can be found at:
https://python.langchain.com/docs/how_to/graph_constructing/

The prompt that LLMGraphTransformer uses to identify graph entities and relationships can be found at:
https://python.langchain.com/api_reference/_modules/langchain_experimental/graph_transformers/llm.html#create_unstructured_prompt

Inport the necessary modules so they can be used later:

In [3]:
from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_ollama.llms import OllamaLLM
from langchain_community.llms.llamafile import Llamafile

def print_graph_results(graph_documents: list[Document]) -> None:
    for doc in graph_documents:
        if len(doc.nodes) > 0:
            print(f"Paper ID: {doc.source.id}")
            print(f"Paper Abstract: {doc.source.page_content}")

            for node in doc.nodes:
                print(node)
                print(f"Node: {node.id}, Type: {node.type}")

            for rel in doc.relationships:
                print(f"Relationship: {rel.type}")
                print(f"   Source: {rel.source.id}, Type: {rel.source.type}")
                print(f"   Target: {rel.target.id}, Type: {rel.target.type}")

            print()

For the first test, allow the model to infer both node types and relationships. Here we'll use the Gemma 3 12B model.

In [4]:
llm = OllamaLLM(model="gemma:latest", temperature=0.1)
#llm = Llamafile()
transformer = LLMGraphTransformer(llm=llm)


# Process a single document for testing
graph_documents = transformer.convert_to_graph_documents(documents[:1])

print_graph_results(graph_documents)

Paper ID: 37227da2b75373b500a6a9f24649dcec
Paper Abstract: Many applications in science and engineering involve data defined at specific geospatial locations, which are often modeled as random fields. The modeling of a proper correlation function is essential for the probabilistic calibration of the random fields, but traditional methods were developed with the assumption to have observations with evenly spaced data. Available methods dealing with irregularly spaced data generally require either interpolation or computationally expensive solutions. Instead, we propose a simple approach based on least square regression to estimate the autocorrelation function. We first tested our methodology on an artificially produced dataset to assess the performance of our method. The accuracy of the method and its robustness to the level of noise in the data indicate that it is suitable for use in realistic problems. In addition, the methodology was used on a major application, the modeling of anima

Next, create a knowledge graph with specific node types.

In [None]:
transformer = LLMGraphTransformer(
    llm=llm,
    allowed_nodes=[
        "Disease Modeling Goal",
        "Diesase Modeling Technique",
        "Disease Model Data Requirement",
        "Disease Modeled",
        "Geographic Location",
    ],
)

# Process a single document for testing
graph_documents = transformer.convert_to_graph_documents(documents[:1])

print_graph_results(graph_documents)

Different models will produce varying results. The Mistral Small 3.1 model should produce better results than the Gemma 3 model; howver, the model requires more memory (~15GB at 4-bit quantization).

In [None]:
llm = OllamaLLM(model="mistral-small:latest", temperature=0.15)
transformer = LLMGraphTransformer(llm=llm)

transformer = LLMGraphTransformer(
    llm=llm,
    allowed_nodes=[
        "Disease Modeling Goal",
        "Diesase Modeling Technique",
        "Disease Model Data Requirement",
        "Disease Modeled",
        "Geographic Location",
    ],
)

# Process a subset of the documents as a test
graph_documents = transformer.convert_to_graph_documents(documents[:1])

print_graph_results(graph_documents)

Relationships can be constrained as well. Either the names of the relatioships can be specificed, and the model will infter the nodes that are connected by these relationships, or allowed subject-predicate-object tuples can be specified.

In [None]:
allowed_relationships = [
    ("Disease Modeled", "LOCATION", "Geographic Location"),
]

transformer = LLMGraphTransformer(
    llm=llm,
    allowed_nodes=[
        "Disease Modeling Goal",
        "Diesase Modeling Technique",
        "Disease Model Data Requirement",
        "Disease Modeled",
        "Geographic Location",
    ],
    allowed_relationships=allowed_relationships,
)

# Process a subset of the documents as a test
graph_documents = transformer.convert_to_graph_documents(documents[:1])

print_graph_results(graph_documents)

can also specify both node and relationship types

In [None]:
llm = OllamaLLM(model="llama3.2:latest", temperature=0.15)

allowed_relationships = [
    ("Disease Modeled", "LOCATION", "Geographic Location"),
]

transformer = LLMGraphTransformer(
    llm=llm,
    allowed_nodes=[
        "Disease Modeling Goal",
        "Diesase Modeling Technique",
        "Disease Model Data Requirement",
        "Disease Modeled",
        "Geographic Location",
    ],
    allowed_relationships=allowed_relationships,
)

# Process a subset of the documents as a test
graph_documents = transformer.convert_to_graph_documents(documents[:1])

print_graph_results(graph_documents)

In [None]:
from langchain_ollama import ChatOllama

llm = ChatOllama(model="mistral-small:latest", temperature=0.15)

transformer = LLMGraphTransformer(
    llm=llm)

# Process a subset of the documents as a test
graph_documents = transformer.convert_to_graph_documents(documents[:1])

print_graph_results(graph_documents)