<a href="https://colab.research.google.com/github/neo4j/graph-data-science-client/blob/main/examples/ml-pipelines-node-classification.ipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine learning pipelines: Node classification

This notebook shows the usage of GDS machine learning pipelines with the Python client and the well-known Cora dataset.

## Setup

We need an environment where Neo4j and GDS are available, for example AuraDS (which comes with GDS preinstalled) or Neo4j Desktop. 

Once the credentials to access this environment are available, we can install the `graphdatascience` package and import the client class.

In [None]:
%pip install graphdatascience

In [129]:
from graphdatascience import GraphDataScience

When using a local Neo4j setup, the default connection URI is `bolt://localhost:7687`. When using AuraDS, the connection URI is slightly different as it uses the `neo4j+s` protocol. In this case, the client should also include the `aura_ds=True` flag to enable AuraDS-recommended settings.

In [130]:
# Replace with the actual connection URI and credentials
NEO4J_CONNECTION_URI = "bolt://localhost:7687"
NEO4J_USERNAME = "neo4j"
NEO4J_PASSWORD = "test"

gds = GraphDataScience(NEO4J_CONNECTION_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))

# On AuraDS:
#
# NEO4J_CONNECTION_URI = "neo4j+s://xxxxxxxx.databases.neo4j.io"
# NEO4J_USERNAME = "neo4j"
# NEO4J_PASSWORD = "..."
#
# gds = GraphDataScience(NEO4J_CONNECTION_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD), aura_ds=True)

We also need to check that the version of the GDS library is 2.2.0 or newer, because we will use the concept of "context" introduced in [GDS 2.2.0](https://github.com/neo4j/graph-data-science/releases/tag/2.2.0).

In [131]:
assert gds.version() >= "2.2.0"

Finally, we import `json` to help in writing the Cypher queries used to load the data, and `numpy` and `pandas` for further data processing.

In [132]:
import json

import numpy as np
import pandas as pd

## Load the Cora dataset

First of all, we need to load the Cora dataset on Neo4j. The CSV files can be found at the following URIs:

In [133]:
CORA_CONTENT = (
    "https://raw.githubusercontent.com/neo4j/graph-data-science/master/test-utils/src/main/resources/cora.content"
)
CORA_CITES = (
    "https://raw.githubusercontent.com/neo4j/graph-data-science/master/test-utils/src/main/resources/cora.cites"
)

Upon loading, we also need to perform an additional preprocessing step to convert the `subject` field (which is a string in the dataset) into an integer, because node properties have to be numerical in order to be projected into a graph; although we could assign consecutive IDs, we assign an ID other than 0 to the first class to show how the classes are represented in the model.

We also select a number of nodes to be held out to test the model after it has been trained. **NOTE:** This is not related to the algorithm test/split ratio.

In [134]:
SUBJECT_TO_ID = {
    "Neural_Networks": 100,
    "Rule_Learning": 1,
    "Reinforcement_Learning": 2,
    "Probabilistic_Methods": 3,
    "Theory": 4,
    "Genetic_Algorithms": 5,
    "Case_Based": 6,
}

HOLDOUT_NODES = 10

We can now load the CSV files using the `LOAD CSV` statement and some basic data transformation:

In [135]:
# Define a string representation of the SUBJECT_TO_ID map using backticks
subject_map = json.dumps(SUBJECT_TO_ID).replace('"', "`")

# Cypher command to load the nodes using `LOAD CSV`, taking care of
# converting the string `subject` field into an integer and
# replacing the node label for the holdout nodes
load_nodes = f"""
    LOAD CSV FROM "{CORA_CONTENT}" AS row
    WITH 
      {subject_map} AS subject_to_id,
      toInteger(row[0]) AS extId, 
      row[1] AS subject, 
      toIntegerList(row[2..]) AS features
    MERGE (p:Paper {{extId: extId, subject: subject_to_id[subject], features: features}})
    WITH p LIMIT {HOLDOUT_NODES}
    REMOVE p:Paper
    SET p:UnclassifiedPaper
"""

# Cypher command to load the relationships using `LOAD CSV`
load_relationships = f"""
    LOAD CSV FROM "{CORA_CITES}" AS row
    MATCH (n), (m) 
    WHERE n.extId = toInteger(row[0]) AND m.extId = toInteger(row[1])
    MERGE (n)-[:CITES]->(m)
"""

# Load nodes and relationships on Neo4j
gds.run_cypher(load_nodes)
gds.run_cypher(load_relationships)

With the data loaded on Neo4j, we can now project a graph including all the nodes and the `CITES` relationship as undirected (and with `SINGLE` aggregation, to skip repeated relationships as a result of adding the inverse direction).

In [149]:
# Create the projected graph containing both classified and unclassified nodes
G, _ = gds.graph.project(
    "cora-graph",
    {"Paper": {"properties": ["features", "subject"]}, "UnclassifiedPaper": {"properties": ["features"]}},
    {"CITES": {"orientation": "UNDIRECTED", "aggregation": "SINGLE"}},
)

Then we can check the number of nodes in the newly-projected graph (which should be 2708), just to make sure it has been created correctly:

In [150]:
assert G.node_count() == 2708

## Pipeline catalog basics

Once the dataset has been loaded, we can define a node classification machine learning pipeline.

In [101]:
# Create the pipeline
node_pipeline, _ = gds.beta.pipeline.nodeClassification.create("cora-pipeline")

We can check that the pipeline has actually been created with the `list` method:

In [102]:
# List all pipelines
gds.beta.pipeline.list()

# Get the details of a specific pipeline object
gds.beta.pipeline.list(node_pipeline)

Unnamed: 0,pipelineInfo,pipelineName,pipelineType,creationTime
0,"{'featurePipeline': {'nodePropertySteps': [], ...",cora-pipeline,Node classification training pipeline,2022-12-09T10:53:34.566624000+00:00


## Configure the pipeline

We can now configure the pipeline. As a reminder, we need to:

1. Select a subset of the available node properties to be used as features for the machine learning model
1. Configure the train/test split and the number of folds for k-fold cross-validation _(optional)_
1. Configure the candidate models for training

In [None]:
# "Mark" some node properties that will be used as features
node_pipeline.selectFeatures(["features"])

In [None]:
# If needed, change the train/test split ratio and the number of folds
# for k-fold cross-validation
node_pipeline.configureSplit(testFraction=0.2, validationFolds=5)

Here we use Logistic Regression as an example for the training, but other algorithms (such as Random Forest) are available as well.

Some hyperparameters such as `penalty` can be single values or ranges. If they are expressed as ranges, auto-tuning is used to search their best value.

In [None]:
# Add a model candidate to train
node_pipeline.addLogisticRegression(maxEpochs=1000, penalty=(0.00038, 0.00042))

## Run the training

It is now possible to train the configured models. We also run a training estimate, to make sure there are enough resources to run the actual training afterwards.

The Node Classification model supports several evaluation metrics. Here we use the global metric `F1_WEIGHTED`.

**NOTE:** The `concurrency` parameter is explicitly set to 4 (the default value) for demonstration purposes. 
The maximum concurrency in the library is limited to 4 for Neo4j Community Edition.

In [None]:
# Estimate the resources needed for training the model
node_pipeline.train_estimate(
    G,
    targetNodeLabels=["Paper"],
    modelName="cora-pipeline-model",
    targetProperty="subject",
    metrics=["F1_WEIGHTED"],
    randomSeed=42,
    concurrency=4,
)

In [107]:
# Perform the actual training
model, stats = node_pipeline.train(
    G,
    targetNodeLabels=["Paper"],
    modelName="cora-pipeline-model",
    targetProperty="subject",
    metrics=["F1_WEIGHTED"],
    randomSeed=42,
    concurrency=4,
)

Node Classification Train Pipeline: 100%|██████████| 100.0/100 [08:56<00:00,  5.36s/%]


We can inspect the result of the training, for example to print the evaluation metrics of the trained model.

In [108]:
# Uncomment to print all stats
# print(stats.to_json(indent=2))

# Print F1_WEIGHTED metric
stats["modelInfo"]["metrics"]["F1_WEIGHTED"]["test"]

0.7367307528092472

## Use the model for prediction

After the model has been trained, it is possible to use it to classify unclassified data. 

One simple way to use the `predict` mode is to just stream the result of the prediction. This can be impractical when a graph is very large, so it should be used only for experimentation purposes.

In this example we use the `targetNodeLabels=["UnclassifiedPaper"]` filter to only run prediction on the unclassified nodes. It must be noted that, when using models that have `nodePropertySteps` that use relationships (such as FastRP and other models that create embeddings, but also algorithms like PageRank), the `targetNodeLabels` filter should include both the classified and the unclassified nodes _unless_ the classified node labels are added with the `contextNodeLabels` parameter. We will see an example of this in the following section.

In [109]:
predicted = model.predict_stream(
    G, modelName="cora-pipeline-model", includePredictedProbabilities=True, targetNodeLabels=["UnclassifiedPaper"]
)

The result of the prediction is a DataFrame containing the predicted class and the predicted probabilities for all the classes for each node.

In [110]:
predicted

Unnamed: 0,nodeId,predictedClass,predictedProbabilities
0,0,100,"[0.013618774341613382, 0.0008177054013954473, ..."
1,1,1,"[0.3186482744624026, 0.02609876129717121, 0.12..."
2,2,2,"[0.007710767899246522, 0.7663149051403245, 0.0..."
3,3,2,"[0.00350003910223565, 0.9768116873834239, 0.00..."
4,4,3,"[0.03084788108403201, 0.0013559744680570265, 0..."
5,5,1,"[0.2750609369819228, 0.26605086734276934, 0.09..."
6,6,6,"[0.01777377696320589, 0.06070786119694413, 0.0..."
7,7,100,"[0.002836007543472271, 0.004250369599202609, 0..."
8,8,100,"[0.018410346867102398, 0.02619614116987539, 0...."
9,9,4,"[0.011515584874249689, 0.33174804059304086, 0...."


The order of the classes in the `predictedProbabilities` field is given in the model information, and can be used to retrieve the predicted probability for the predicted class.

In [111]:
classes = stats["modelInfo"]["classes"]
print(classes)

[1, 2, 3, 4, 5, 6, 100]


In [112]:
# Calculate the confidence percentage for the predicted class
predicted["confidence"] = predicted.apply(
    lambda row: np.floor(row["predictedProbabilities"][classes.index(row["predictedClass"])] * 100), axis=1
)

In [113]:
predicted

Unnamed: 0,nodeId,predictedClass,predictedProbabilities,confidence
0,0,100,"[0.013618774341613382, 0.0008177054013954473, ...",73.0
1,1,1,"[0.3186482744624026, 0.02609876129717121, 0.12...",31.0
2,2,2,"[0.007710767899246522, 0.7663149051403245, 0.0...",76.0
3,3,2,"[0.00350003910223565, 0.9768116873834239, 0.00...",97.0
4,4,3,"[0.03084788108403201, 0.0013559744680570265, 0...",86.0
5,5,1,"[0.2750609369819228, 0.26605086734276934, 0.09...",27.0
6,6,6,"[0.01777377696320589, 0.06070786119694413, 0.0...",82.0
7,7,100,"[0.002836007543472271, 0.004250369599202609, 0...",89.0
8,8,100,"[0.018410346867102398, 0.02619614116987539, 0....",54.0
9,9,4,"[0.011515584874249689, 0.33174804059304086, 0....",46.0


## Adding a data preprocessing step

The performance of the model can potentially be increased by adding more features or by using different features altogether. One way is to use models such as FastRP that create embeddings based on both node properties and graph features, which can be added via the `addNodeProperty` pipeline method.

In this example we also use the `contextNodeLabels` parameter to restrict the type of nodes to calculate the embeddings on. This is useful in the prediction phase as it makes unnecessary to explicitly include the labelled nodes.

More embedding methods are available in GDS, as well as other pre-processing algorithms.

In [None]:
node_pipeline_fastrp, _ = gds.beta.pipeline.nodeClassification.create("cora-pipeline-fastrp")

# Add a step in the pipeline that mutates the graph
node_pipeline_fastrp.addNodeProperty(
    "fastRP",
    mutateProperty="embedding",
    embeddingDimension=512,
    propertyRatio=1.0,
    randomSeed=42,
    featureProperties=["features"],
    contextNodeLabels=["Paper"]
)

With the node embeddings available as features, we no longer use the original raw `features`.

In [None]:
node_pipeline_fastrp.selectFeatures(["embedding"])

In [None]:
# Configure the pipeline as before
node_pipeline_fastrp.configureSplit(testFraction=0.2, validationFolds=5)

node_pipeline_fastrp.addLogisticRegression(maxEpochs=1000, penalty=(0.00048, 0.00050))

In [None]:
# Perform the actual training
model_fastrp, stats_fastrp = node_pipeline_fastrp.train(
    G,
    targetNodeLabels=["Paper"],
    modelName="cora-pipeline-model-fastrp",
    targetProperty="subject",
    metrics=["F1_WEIGHTED"],
    randomSeed=42,
    concurrency=4,
)

In [142]:
print(stats_fastrp["modelInfo"]["metrics"]["F1_WEIGHTED"]["test"])

0.8466258544158551


## Use the new model for prediction

When using models that have `nodePropertySteps` that use relationships (such as FastRP and other models that create embeddings, but also algorithms like PageRank), the `targetNodeLabels` must include both classified and unclassified nodes because otherwise the resulting embeddings would be "skewed".

Here we are using the `targetNodeLabels` parameter **only** with `UnclassifiedPaper` because we added the `Paper` label as a context node label.

In [143]:
predicted_fastrp = model_fastrp.predict_stream(
    G, modelName="cora-pipeline-model-fastrp", includePredictedProbabilities=True,
    targetNodeLabels=["UnclassifiedPaper"]
)

Node Classification Predict Pipeline: 100%|██████████| 100.0/100 [00:00<00:00, 227.43%/s]


In [144]:
predicted_fastrp.count()

nodeId                    10
predictedClass            10
predictedProbabilities    10
dtype: int64

Since we have used no filters, the `predicted` result contains _all_ the nodes. The way to filter the nodes is via the `streamNodeProperty` method, which can be used only after the new property is written to the graph via the `mutate` mode. 

Instead of streaming the results, the prediction can be run in `mutate` mode to be more performant. The predicted nodes can be retrieved using the `streamNodeProperty` method with the `UnclassifiedPaper` class.

In [None]:
model_fastrp.predict_mutate(
    G,
    mutateProperty="predictedClass",
    modelName="cora-pipeline-model-fastrp",
    predictedProbabilityProperty="predictedProbabilities",
    targetNodeLabels=["UnclassifiedPaper"]
)

In [153]:
predicted_fastrp = gds.graph.streamNodeProperty(G, "predictedClass", ["UnclassifiedPaper"])

predicted_fastrp

Unnamed: 0,nodeId,propertyValue
0,0,100
1,1,1
2,2,2
3,3,2
4,4,3
5,5,3
6,6,4
7,7,100
8,8,100
9,9,4


In [154]:
# Retrieve node information from Neo4j using the node IDs from the prediction result
nodes = gds.util.asNodes(predicted_fastrp.nodeId.to_list())

# Create a new DataFrame containing node IDs along with node properties
nodes_df = pd.DataFrame([(node.id, node["subject"]) for node in nodes], columns=["nodeId", "subject"])

# Merge with the prediction result on node IDs, to check the predicted value
# against the original subject
#
# NOTE: This could also be replaced by just appending `node["subject"]` as a
# Series since the node order would not change, but a proper merge (or join)
# is clearer and less prone to errors.
predicted_fastrp.merge(nodes_df, on="nodeId")

Unnamed: 0,nodeId,propertyValue,subject
0,0,100,100
1,1,1,1
2,2,2,2
3,3,2,2
4,4,3,3
5,5,3,3
6,6,4,4
7,7,100,100
8,8,100,100
9,9,4,4


As we can see, the prediction for all the holdout nodes is accurate.

## Write result back to Neo4j

In [None]:
model_fastrp.predict_write(
    G,
    writeProperty="predictedSubject",
    modelName="cora-pipeline-model-fastrp",
    predictedProbabilityProperty="predictedProbabilities",
)

## Cleanup

When the graph, the model and the pipeline are no longer needed, they should be dropped to free up memory:

In [None]:
model.drop()
model_fastrp.drop()
node_pipeline.drop()
node_pipeline_fastrp.drop()

G.drop()
gds.run_cypher("MATCH (n) DETACH DELETE n")

It is good practice to close the client as well:

In [157]:
gds.close()