<a href="https://colab.research.google.com/github/olonok69/LLM_Notebooks/blob/main/neo4j/Neo4j_ML_Training_methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training methods

Node Classification Pipelines, Node Regression Pipelines, and Link Prediction Pipelines are trained using supervised machine learning methods. These methods have several hyperparameters that one can set to influence the training. The objective of this page is to give a brief overview of the methods, as well as advice on how to tune their hyperparameters.

For instructions on how to add model candidates, see the sections Adding model candidates (Node Classification), Adding model candidates (Node Regression), and Adding model candidates (Link Prediction). During training, auto-tuning is carried out to select a best candidate and the best values for its hyper-parameters.

The training methods currently support in the Neo4j Graph Data Science library are:

## Classification (Beta)

- Logistic regression https://neo4j.com/docs/graph-data-science/current/machine-learning/training-methods/logistic-regression/

- Random forest https://neo4j.com/docs/graph-data-science/current/machine-learning/training-methods/random-forest/

## Classification (Alpha)

- Multilayer Perceptron  https://neo4j.com/docs/graph-data-science/current/machine-learning/training-methods/mlp/

## Regression (Alpha)

- Random forest https://neo4j.com/docs/graph-data-science/current/machine-learning/training-methods/mlp/

- Linear regression https://neo4j.com/docs/graph-data-science/current/machine-learning/training-methods/linear-regression/



# Node classification pipelines

https://neo4j.com/docs/graph-data-science/current/machine-learning/node-property-prediction/nodeclassification-pipelines/node-classification/

Node Classification is a common machine learning task applied to graphs: training models to classify nodes. Concretely, Node Classification models are used to predict the classes of unlabeled nodes as a node properties based on other node properties. During training, the property representing the class of the node is referred to as the target property. GDS supports both binary and multi-class node classification.


In GDS, we have Node Classification pipelines which offer an end-to-end workflow, from feature extraction to node classification. The training pipelines reside in the pipeline catalog. When a training pipeline is executed, a classification model is created and stored in the model catalog.

A training pipeline is a sequence of two phases:

- The graph is augmented with new node properties in a series of steps.

- The augmented graph is used for training a node classification model.

One can configure which steps should be included above. The steps execute GDS algorithms that create new node properties. After configuring the node property steps, one can select a subset of node properties to be used as features. The training phase (II) trains multiple model candidates using cross-validation, selects the best one, and reports relevant performance metrics.

After training the pipeline, a classification model is created. This model includes the node property steps and feature configuration from the training pipeline and uses them to generate the relevant features for classifying unlabeled nodes. The classification model can be applied to predict the class of previously unseen nodes. In addition to the predicted class for each node, the predicted probability for each class may also be retained on the nodes. The order of the probabilities matches the order of the classes registered in the model.

In [1]:
%pip install graphdatascience

Collecting graphdatascience
  Downloading graphdatascience-1.10-py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multimethod<2.0,>=1.0 (from graphdatascience)
  Downloading multimethod-1.11.2-py3-none-any.whl (10 kB)
Collecting neo4j<6.0,>=4.4.2 (from graphdatascience)
  Downloading neo4j-5.19.0.tar.gz (202 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m203.0/203.0 kB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting textdistance<5.0,>=4.0 (from graphdatascience)
  Downloading textdistance-4.6.1-py3-none-any.whl (31 kB)
Building wheels for collected packages: neo4j
  Building wheel for neo4j (pyproject.toml) ... [?25l[?

In [2]:
import os
from graphdatascience import GraphDataScience


In [55]:
# Get Neo4j DB URI and credentials from environment if applicable
NEO4J_URI = "bolt://44.204.192.158:7687"
NEO4J_USER = "neo4j"
NEO4J_PASSWORD = "kills-man-labels"
NEO4J_AUTH = (
       NEO4J_USER,
       NEO4J_PASSWORD,
    )


gds = GraphDataScience(NEO4J_URI, auth=NEO4J_AUTH)

In [56]:
import json
import numpy as np
import pandas as pd

## Loading the Cora dataset

The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.


https://graphsandnetworks.com/the-cora-dataset/

In [103]:
CORA_CONTENT = "https://data.neo4j.com/cora/cora.content"
CORA_CITES = "https://data.neo4j.com/cora/cora.cites"
SUBJECT_TO_ID = {
    "Neural_Networks": 100,
    "Rule_Learning": 1,
    "Reinforcement_Learning": 2,
    "Probabilistic_Methods": 3,
    "Theory": 4,
    "Genetic_Algorithms": 5,
    "Case_Based": 6,
}

HOLDOUT_NODES = 10

In [104]:
# Define a string representation of the SUBJECT_TO_ID map using backticks
subject_map = json.dumps(SUBJECT_TO_ID).replace('"', "`")

# Cypher command to load the nodes using `LOAD CSV`, taking care of
# converting the string `subject` field into an integer and
# replacing the node label for the holdout nodes
load_nodes = f"""
    LOAD CSV FROM "{CORA_CONTENT}" AS row
    WITH
      {subject_map} AS subject_to_id,
      toInteger(row[0]) AS extId,
      row[1] AS subject,
      toIntegerList(row[2..]) AS features
    MERGE (p:Paper {{extId: extId, subject: subject_to_id[subject], features: features}})
    WITH p LIMIT {HOLDOUT_NODES}
    REMOVE p:Paper
    SET p:UnclassifiedPaper
"""

# Cypher command to load the relationships using `LOAD CSV`
load_relationships = f"""
    LOAD CSV FROM "{CORA_CITES}" AS row
    MATCH (n), (m)
    WHERE n.extId = toInteger(row[0]) AND m.extId = toInteger(row[1])
    MERGE (n)-[:CITES]->(m)
"""

# Load nodes and relationships on Neo4j
gds.run_cypher(load_nodes)
gds.run_cypher(load_relationships)

In [105]:
# Create the projected graph containing both classified and unclassified nodes
G, _ = gds.graph.project(
    "cora-graph",
    {"Paper": {"properties": ["features", "subject"]}, "UnclassifiedPaper": {"properties": ["features"]}},
    {"CITES": {"orientation": "UNDIRECTED", "aggregation": "SINGLE"}},
)

In [115]:
assert G.node_count() == 2708
assert G.relationship_count() == 10556

In [116]:
# Create the pipeline
node_pipeline, result = gds.beta.pipeline.nodeClassification.create("cora-pipeline")

In [114]:
gds.beta.pipeline.drop(node_pipeline)

pipelineInfo    {'splitConfig': {'testFraction': 0.2, 'validat...
pipelineName                                        cora-pipeline
pipelineType                Node classification training pipeline
creationTime                  2024-04-19T19:22:48.779424049+00:00
Name: 0, dtype: object

In [117]:
result

name                                                     cora-pipeline
nodePropertySteps                                                   []
featureProperties                                                   []
splitConfig                {'testFraction': 0.3, 'validationFolds': 3}
autoTuningConfig                                     {'maxTrials': 10}
parameterSpace       {'MultilayerPerceptron': [], 'RandomForest': [...
Name: 0, dtype: object

In [118]:
node_pipeline

NCTrainingPipeline({'pipelineInfo': {0: {'splitConfig': {'testFraction': 0.3, 'validationFolds': 3}, 'autoTuningConfig': {'maxTrials': 10}, 'featurePipeline': {'featureProperties': [], 'nodePropertySteps': []}, 'trainingParameterSpace': {'MultilayerPerceptron': [], 'RandomForest': [], 'LogisticRegression': []}}}, 'pipelineName': {0: 'cora-pipeline'}, 'pipelineType': {0: 'Node classification training pipeline'}, 'creationTime': {0: neo4j.time.DateTime(2024, 4, 19, 19, 26, 55, 805445755, tzinfo=<UTC>)}})

In [110]:
# List all pipelines
gds.beta.pipeline.list()

Unnamed: 0,pipelineInfo,pipelineName,pipelineType,creationTime
0,"{'splitConfig': {'testFraction': 0.3, 'validat...",cora-pipeline,Node classification training pipeline,2024-04-19T19:22:48.779424049+00:00


In [111]:
# Alternatively, get the details of a specific pipeline object
gds.beta.pipeline.list(node_pipeline)

Unnamed: 0,pipelineInfo,pipelineName,pipelineType,creationTime
0,"{'splitConfig': {'testFraction': 0.3, 'validat...",cora-pipeline,Node classification training pipeline,2024-04-19T19:22:48.779424049+00:00


# Auto-tuning
https://neo4j.com/docs/graph-data-science/current/machine-learning/auto-tuning/

In [119]:
# "Mark" some node properties that will be used as features
node_pipeline.selectFeatures(["features"])

# If needed, change the train/test split ratio and the number of folds
# for k-fold cross-validation
node_pipeline.configureSplit(testFraction=0.2, validationFolds=5)

# Add a model candidate to train
node_pipeline.addLogisticRegression(maxEpochs=200, penalty=(0.0, 0.5))
#node_pipeline.addRandomForest(maxDepth=3)

# Explicit set the number of trials for autotuning (default = 10)
node_pipeline.configureAutoTuning(maxTrials=5)

name                                                     cora-pipeline
nodePropertySteps                                                   []
featureProperties                                           [features]
splitConfig                {'testFraction': 0.2, 'validationFolds': 5}
autoTuningConfig                                      {'maxTrials': 5}
parameterSpace       {'MultilayerPerceptron': [], 'RandomForest': [...
Name: 0, dtype: object

In [120]:
# Estimate the resources needed for training the model
node_pipeline.train_estimate(
    G,
    targetNodeLabels=["Paper"],
    modelName="cora-pipeline-model",
    targetProperty="subject",
    metrics=["F1_WEIGHTED"],
    randomSeed=42,
    concurrency=4,
)

requiredMemory                                     [64 MiB ... 64 MiB]
treeView             Memory Estimation: [64 MiB ... 64 MiB]\n|-- al...
mapView              {'memoryUsage': '[64 MiB ... 64 MiB]', 'name':...
bytesMin                                                      67130384
bytesMax                                                      67162344
nodeCount                                                         2698
relationshipCount                                                10502
heapPercentageMin                                                  0.1
heapPercentageMax                                                  0.1
Name: 0, dtype: object

# Metrics

OUT_OF_BAG_ERROR, F1_MACRO, ACCURACY, F1_WEIGHTED, RECALL(class=*), RECALL(class=<class value>), ACCURACY(class=*), ACCURACY(class=<class value>), F1(class=*), F1(class=<class value>), PRECISION(class=*),
PRECISION(class=<class value>)

In [84]:
#model.drop() #if model exists

Unnamed: 0,modelName,modelType,modelInfo,creationTime,trainConfig,graphSchema,loaded,stored,published,shared


In [121]:
# Perform the actual training
model, stats = node_pipeline.train(
    G,
    targetNodeLabels=["Paper"],
    modelName="cora-pipeline-model",
    targetProperty="subject",
    metrics=["F1_WEIGHTED"],
    randomSeed=42,
    concurrency=4,
)

Node Classification Train Pipeline:   0%|          | 0/100 [00:00<?, ?%/s]

In [18]:
#print(stats.to_json(indent=2))

In [122]:
# logistic 0.7255989924538443
# Print F1_WEIGHTED metric
stats["modelInfo"]["metrics"]["F1_WEIGHTED"]["test"]

0.7255989924538444

## Using the model for prediction

In [123]:
predicted = model.predict_stream(
    G, modelName="cora-pipeline-model", includePredictedProbabilities=True, targetNodeLabels=["UnclassifiedPaper"]
)

In [124]:
predicted

Unnamed: 0,nodeId,predictedClass,predictedProbabilities
0,0,100,"[0.07101325099624993, 0.027586109083241425, 0...."
1,1,5,"[0.1029856949249486, 0.048235215107684866, 0.1..."
2,2,2,"[0.03866886076077883, 0.4728048870571118, 0.04..."
3,3,2,"[0.031489446893675534, 0.7106977819083203, 0.1..."
4,4,3,"[0.04998359412651768, 0.021991878791378778, 0...."
5,5,5,"[0.16408655879929296, 0.1728289555341089, 0.10..."
6,6,6,"[0.05558324252234833, 0.11404625254866822, 0.0..."
7,7,100,"[0.035468515160710215, 0.06123368878990283, 0...."
8,8,100,"[0.07281324940781957, 0.09634682929676182, 0.2..."
9,9,4,"[0.05529402843075868, 0.16560916173235118, 0.0..."


In [125]:
# List of class labels
classes = stats["modelInfo"]["classes"]
print("Class labels:", classes)

# Calculate the confidence percentage for the predicted class
predicted["confidence"] = predicted.apply(
    lambda row: np.floor(row["predictedProbabilities"][classes.index(row["predictedClass"])] * 100), axis=1
)

predicted

Class labels: [1, 2, 3, 4, 5, 6, 100]


Unnamed: 0,nodeId,predictedClass,predictedProbabilities,confidence
0,0,100,"[0.07101325099624993, 0.027586109083241425, 0....",43.0
1,1,5,"[0.1029856949249486, 0.048235215107684866, 0.1...",20.0
2,2,2,"[0.03866886076077883, 0.4728048870571118, 0.04...",47.0
3,3,2,"[0.031489446893675534, 0.7106977819083203, 0.1...",71.0
4,4,3,"[0.04998359412651768, 0.021991878791378778, 0....",61.0
5,5,5,"[0.16408655879929296, 0.1728289555341089, 0.10...",18.0
6,6,6,"[0.05558324252234833, 0.11404625254866822, 0.0...",49.0
7,7,100,"[0.035468515160710215, 0.06123368878990283, 0....",48.0
8,8,100,"[0.07281324940781957, 0.09634682929676182, 0.2...",28.0
9,9,4,"[0.05529402843075868, 0.16560916173235118, 0.0...",48.0


## Adding a data preprocessing step

The quality of the model can potentially be increased by adding more features or by using different features altogether. One way is to use algorithms such as FastRP that create embeddings based on both node properties and graph features, which can be added via the addNodeProperty pipeline method. Such properties are "transient", in that they are automatically created and removed by the pipeline itself.

In this example we also use the contextNodeLabels parameter to explicitly set the types of nodes we calculate the embeddings for, and we include both the classified and the unclassified nodes.

In [126]:
node_pipeline_fastrp, _ = gds.beta.pipeline.nodeClassification.create("cora-pipeline-fastrp")

# Add a step in the pipeline that mutates the graph
node_pipeline_fastrp.addNodeProperty(
    "fastRP",
    mutateProperty="embedding",
    embeddingDimension=512,
    propertyRatio=1.0,
    randomSeed=42,
    featureProperties=["features"],
    contextNodeLabels=["Paper", "UnclassifiedPaper"],
)

# With the node embeddings available as features, we no longer use the original raw `features`.
node_pipeline_fastrp.selectFeatures(["embedding"])

# Configure the pipeline as before
node_pipeline_fastrp.configureSplit(testFraction=0.2, validationFolds=5)
node_pipeline_fastrp.addLogisticRegression(maxEpochs=200, penalty=(0.0, 0.5))
node_pipeline.configureAutoTuning(maxTrials=5)

name                                                     cora-pipeline
nodePropertySteps                                                   []
featureProperties                                           [features]
splitConfig                {'testFraction': 0.2, 'validationFolds': 5}
autoTuningConfig                                      {'maxTrials': 5}
parameterSpace       {'MultilayerPerceptron': [], 'RandomForest': [...
Name: 0, dtype: object

In [127]:
# Perform the actual training
model_fastrp, stats_fastrp = node_pipeline_fastrp.train(
    G,
    targetNodeLabels=["Paper"],
    modelName="cora-pipeline-model-fastrp",
    targetProperty="subject",
    metrics=["F1_WEIGHTED"],
    randomSeed=42,
    concurrency=4,
)

Node Classification Train Pipeline:   0%|          | 0/100 [00:00<?, ?%/s]

In [128]:
print(stats_fastrp["modelInfo"]["metrics"]["F1_WEIGHTED"]["test"])

0.8323028609950918


In [129]:
predicted_fastrp = model_fastrp.predict_stream(
    G,
    modelName="cora-pipeline-model-fastrp",
    includePredictedProbabilities=True,
    targetNodeLabels=["UnclassifiedPaper"],
)

Node Classification Predict Pipeline:   0%|          | 0/100 [00:00<?, ?%/s]

In [130]:
predicted_fastrp


Unnamed: 0,nodeId,predictedClass,predictedProbabilities
0,0,100,"[0.05499660223163993, 0.061287374013405144, 0...."
1,1,1,"[0.49739902155116716, 0.05438138894952869, 0.0..."
2,2,2,"[0.030723800132530454, 0.7325261897220301, 0.0..."
3,3,2,"[0.039537239277303966, 0.5019245722042818, 0.0..."
4,4,3,"[0.01878290590790461, 0.030478214956835798, 0...."
5,5,3,"[0.028792159042480527, 0.12487238779892261, 0...."
6,6,4,"[0.04493768578261787, 0.21175407085175235, 0.1..."
7,7,100,"[0.056839542198208026, 0.0643912886489185, 0.2..."
8,8,100,"[0.05245380735413178, 0.06048191514973327, 0.2..."
9,9,4,"[0.041905105715991445, 0.07435200584890143, 0...."


In [131]:
model_fastrp.predict_mutate(
    G,
    mutateProperty="predictedClass",
    modelName="cora-pipeline-model-fastrp",
    predictedProbabilityProperty="predictedProbabilities",
    targetNodeLabels=["UnclassifiedPaper"],
)

predicted_fastrp = gds.graph.nodeProperty.stream(G, "predictedClass", ["UnclassifiedPaper"])

Node Classification Predict Pipeline:   0%|          | 0/100 [00:00<?, ?%/s]

In [132]:
predicted_fastrp

Unnamed: 0,nodeId,propertyValue,nodeLabels
0,0,100,[]
1,1,1,[]
2,2,2,[]
3,3,2,[]
4,4,3,[]
5,5,3,[]
6,6,4,[]
7,7,100,[]
8,8,100,[]
9,9,4,[]


In [133]:
# Retrieve node information from Neo4j using the node IDs from the prediction result
nodes = gds.util.asNodes(predicted_fastrp.nodeId.to_list())

# Create a new DataFrame containing node IDs along with node properties
nodes_df = pd.DataFrame([(node.id, node["subject"]) for node in nodes], columns=["nodeId", "subject"])

# Merge with the prediction result on node IDs, to check the predicted value
# against the original subject
#
# NOTE: This could also be replaced by just appending `node["subject"]` as a
# Series since the node order would not change, but a proper merge (or join)
# is clearer and less prone to errors.
predicted_fastrp.merge(nodes_df, on="nodeId")

  nodes_df = pd.DataFrame([(node.id, node["subject"]) for node in nodes], columns=["nodeId", "subject"])


Unnamed: 0,nodeId,propertyValue,nodeLabels,subject
0,0,100,[],100
1,1,1,[],1
2,2,2,[],2
3,3,2,[],2
4,4,3,[],3
5,5,3,[],3
6,6,4,[],4
7,7,100,[],100
8,8,100,[],100
9,9,4,[],4


In [134]:
gds.graph.nodeProperties.write(
    G,
    node_properties=["predictedClass"],
    node_labels=["UnclassifiedPaper"],
)

writeMillis                         5
graphName                  cora-graph
nodeProperties       [predictedClass]
propertiesWritten                  10
Name: 0, dtype: object

In [135]:
model.drop()
model_fastrp.drop()
node_pipeline.drop()
node_pipeline_fastrp.drop()

G.drop()

graphName                                                       cora-graph
database                                                             neo4j
databaseLocation                                                     local
memoryUsage                                                               
sizeInBytes                                                             -1
nodeCount                                                             2708
relationshipCount                                                    10556
configuration            {'relationshipProjection': {'CITES': {'aggrega...
density                                                            0.00144
creationTime                           2024-04-19T19:22:10.520025911+00:00
modificationTime                       2024-04-19T19:40:41.462416864+00:00
schema                   {'graphProperties': {}, 'nodes': {'Paper': {'s...
schemaWithOrientation    {'graphProperties': {}, 'nodes': {'Paper': {'s...
Name: 0, dtype: object

In [136]:
gds.run_cypher("MATCH (n) WHERE n:Paper OR n:UnclassifiedPaper DETACH DELETE n")

In [137]:
gds.close()