<a href="https://colab.research.google.com/github/nvitucci/graph-data-science-client/blob/add-pipelines-notebook/ml_pipelines_node_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine learning pipelines: node classification example

This notebook shows the usage of GDS machine learning pipelines with the Python client and the well-known Cora dataset.

## Setup

We need an environment where Neo4j and GDS are available, for example AuraDS (which comes with GDS preinstalled) or Neo4j Desktop. 

Once the credentials to this environment are available, we can install the `graphdatascience` package and create the `gds` object.

In [None]:
!pip install graphdatascience==1.3.0

In [None]:
# Import the client
from graphdatascience import GraphDataScience

# Replace with the actual credentials
AURA_CONNECTION_URI = "neo4j+s://xxxxxxxx.databases.neo4j.io"
AURA_USERNAME = "neo4j"
AURA_PASSWORD = ""

# Configure the client with AuraDS-recommended settings if using AuraDS
gds = GraphDataScience(
    AURA_CONNECTION_URI,
    auth=(AURA_USERNAME, AURA_PASSWORD),
    aura_ds=True
)

We import `json` to help in writing the Cypher queries used to load the data, and `numpy` and `pandas` for further data processing.

In [None]:
import json

import numpy as np
import pandas as pd

## Load the Cora dataset

First of all, we need to load the Cora dataset on Neo4j. Upon loading, we also need to perform an additional preprocessing step to convert the `subject` field (which is a string in the dataset) into an integer, because node properties have to be numerical in order to be projected into a graph; although we could assign consecutive IDs, we assign an ID other than 0 to the first class to show how the classes are represented in the model.

Finally, we select a number of nodes to be held out to test the model after it has been trained. This is not related to the algorithm test/split ratio.

In [None]:
# TODO: use URLs within the client repo when the notebook is added there
CORA_CONTENT = "https://raw.githubusercontent.com/neo4j/graph-data-science/master/test-utils/src/main/resources/cora.content"
CORA_CITES = "https://raw.githubusercontent.com/neo4j/graph-data-science/master/test-utils/src/main/resources/cora.cites"

In [None]:
# Using non-consecutive values to show class ordering later on

SUBJECT_TO_ID = {
    "Neural_Networks": 100,
    "Rule_Learning": 1,
    "Reinforcement_Learning": 2,
    "Probabilistic_Methods": 3,
    "Theory": 4,
    "Genetic_Algorithms": 5,
    "Case_Based": 6
}

HOLDOUT_NODES = 10

In [None]:
# Define a string representation of the SUBJECT_TO_ID map using backticks
subject_map = json.dumps(SUBJECT_TO_ID).replace('"', '`')

print(subject_map)

In [None]:
# Cypher command to load the nodes using `LOAD CSV`, taking care of
# converting the string `subject` field into an integer and
# replacing the node label for the holdout nodes
load_nodes = f"""
    LOAD CSV FROM "{CORA_CONTENT}" AS row
    WITH 
      {subject_map} AS subject_to_id,
      toInteger(row[0]) AS extId, 
      row[1] AS subject, 
      toIntegerList(row[2..]) AS features
    MERGE (p:Paper {{extId: extId, subject: subject_to_id[subject], features: features}})
    WITH p LIMIT {HOLDOUT_NODES}
    REMOVE p:Paper
    SET p:UnclassifiedPaper
"""

# Cypher command to load the relationships using `LOAD CSV`
load_relationships = f"""
    LOAD CSV FROM "{CORA_CITES}" AS row
    MATCH (n), (m) 
    WHERE n.extId = toInteger(row[0]) AND m.extId = toInteger(row[1])
    MERGE (n)-[:CITES]->(m)
"""

# Load nodes and relationships on Neo4j
gds.run_cypher(load_nodes)
gds.run_cypher(load_relationships)

With the data loaded on Neo4j, we can now project a graph including all the nodes and the `CITES` relationship as undirected (and with `SINGLE` aggregation, to skip repeated relationships as a result of adding the inverse direction).

In [None]:
# Create the projected graph containing both classified and unclassified nodes
G, _ = gds.graph.project(
    "cora-graph",
    {
        "Paper": {
            "properties": ["features", "subject"]
        },
        "UnclassifiedPaper": {
            "properties": ["features"]
        }
    },
    {
        "CITES": {
            "orientation": "UNDIRECTED",
            "aggregation": "SINGLE"
        }
    }
)

## Pipeline catalog basics

Once the dataset has been loaded, we can define a node classification machine learning pipeline.

In [None]:
# Create the pipeline
node_pipeline, _ = gds.beta.pipeline.nodeClassification.create("cora-pipeline")

In [None]:
# List all pipelines
gds.beta.pipeline.list()

In [None]:
# List a specific pipeline object
gds.beta.pipeline.list(node_pipeline)

## Configure the pipeline

We can now configure the pipeline. As a reminder, we need to:

1. Select a subset of the available node properties to be used as features for the machine learning model
1. Configure the train/test split and the number of folds for k-fold cross-validation _(optional)_
1. Configure the candidate models for training

In [None]:
# "Mark" some node properties that will be used as features
node_pipeline.selectFeatures(
    ["features"]
)

In [None]:
# If needed, change the train/test split ratio and the number of folds
# for k-fold cross-validation
node_pipeline.configureSplit(
    testFraction=0.2,
    validationFolds=5
)

Here we use Logistic Regression as an example for the training, but other algorithms (such as Random Forest) are available as well.

Some hyperparameters such as `penalty` can be single values or ranges. If they are expressed as ranges, auto-tuning is used to search their best value.

In [None]:
# Add a model candidate to train
# Note: penalty can be a single value like 0.0004 or a range
node_pipeline.addLogisticRegression(
    maxEpochs=1000,
    penalty=(0.00038, 0.00042)
)

## Run the training

It is now possible to train the configured models. We also run a training estimate, to make sure there are enough resources to run the actual training afterwards.

The Node Classification model supports several evaluation metrics. Here we use the global metric `F1_WEIGHTED`.

**NOTE:** The `concurrency` parameter is explicitly set to 4 (the default value) for demonstration purposes. 
The maximum concurrency in the library is limited to 4 for Neo4j Community Edition.

In [None]:
# Estimate the resources needed for training the model
node_pipeline.train_estimate(
    G,
    nodeLabels=["Paper"],
    modelName="cora-pipeline-model",
    targetProperty="subject",
    metrics=["F1_WEIGHTED"],
    randomSeed=42,
    concurrency=4
)

In [None]:
# Perform the actual training
model, stats = node_pipeline.train(
    G,
    nodeLabels=["Paper"],
    modelName="cora-pipeline-model",
    targetProperty="subject",
    metrics=["F1_WEIGHTED"],
    randomSeed=42,
    concurrency=4
)

We can inspect the result of the training, for example to print the evaluation metrics of the trained model.

In [None]:
# Uncomment to print all stats
# print(stats.to_json(indent=2))

# Print F1_WEIGHTED metric
print(stats["modelInfo"]["metrics"]["F1_WEIGHTED"]["test"])

## Use the model for prediction

After the model has been trained, it is possible to use it to classify unclassified data. 

One simple way to use the `predict` mode is to just stream the result of the prediction. This can be impractical when a graph is very large, so it should be used only for experimentation purposes.

In this example we use the `nodeLabels=["UnclassifiedPaper"]` filter to only run prediction on the unclassified nodes. It must be noted that, when using models that have `nodePropertySteps` that use relationships (such as FastRP and other models that create embeddings, but also algorithms like PageRank), the `nodeLabels` filter should not be used because it would "cut out" all the linked nodes that have a different label. We will see an example of this in the following section.

In [None]:
predicted = model.predict_stream(
    G,
    modelName="cora-pipeline-model",
    includePredictedProbabilities=True,
    nodeLabels=["UnclassifiedPaper"]
)

The result of the prediction is a DataFrame containing the predicted class and the predicted probabilities for all the classes for each node.

In [None]:
predicted

The order of the classes in the `predictedProbabilities` field is given in the model information, and can be used to retrieve the predicted probability for the predicted class.

In [None]:
classes = stats["modelInfo"]["classes"]
print(classes)

In [None]:
# Calculate the confidence percentage for the predicted class
predicted["confidence"] = predicted.apply(
    lambda row: np.floor(row["predictedProbabilities"][classes.index(row["predictedClass"])] * 100), 
    axis=1
)

In [None]:
predicted

## Adding a data preprocessing step

The performance of the model can potentially be increased by adding more features or by using different features altogether. One way is to use models that create embeddings based on both node properties and graph features. One of such models is FastRP, which can be added via the `addNodeProperty` pipeline method.

More embedding methods are available in GDS, as well as other pre-processing algorithms.

In [None]:
node_pipeline_fastrp, _ = gds.beta.pipeline.nodeClassification.create("cora-pipeline-fastrp")

# Add a step in the pipeline that mutates the graph
node_pipeline_fastrp.addNodeProperty(
    "fastRP",
    mutateProperty="embedding",
    embeddingDimension=512,
    propertyRatio=1.0,
    randomSeed=42,
    featureProperties=["features"]
)

With the node embeddings available as features, we no longer use the original raw `features`.

In [None]:
node_pipeline_fastrp.selectFeatures(
    ["embedding"]
)

In [None]:
# Configure the pipeline as before
node_pipeline_fastrp.configureSplit(
    testFraction=0.2,
    validationFolds=5
)

node_pipeline_fastrp.addLogisticRegression(
    maxEpochs=1000,
    penalty=(0.00048, 0.00050)
)

In [None]:
# Perform the actual training
model_fastrp, stats_fastrp = node_pipeline_fastrp.train(
    G,
    nodeLabels=["Paper"],
    modelName="cora-pipeline-model-fastrp",
    targetProperty="subject",
    metrics=["F1_WEIGHTED"],
    randomSeed=42,
    concurrency=4
)

In [None]:
print(stats_fastrp["modelInfo"]["metrics"]["F1_WEIGHTED"]["test"])

## Use the model for prediction

Here we are **not** using the `nodeLabels=["UnclassifiedPaper"]` parameter, because FastRP depends on neighbour nodes. When using models that have `nodePropertySteps` that use relationships (such as FastRP and other models that create embeddings, but also algorithms like PageRank), the `nodeLabels` filter should not be used because it would "cut out" all the linked nodes that have a different label.

In [None]:
predicted_fastrp = model_fastrp.predict_stream(
    G,
    modelName="cora-pipeline-model-fastrp",
    includePredictedProbabilities=True
)

In [None]:
predicted_fastrp.count()

Since we have used no filters, the `predicted` result contains _all_ the nodes. The way to filter the nodes is via the `streamNodeProperty` method, which can be used only after the new property is written to the graph via the `mutate` mode. 

Instead of streaming the results, the prediction can be run in `mutate` mode to be more performant. The predicted nodes can be retrieved using the `streamNodeProperty` method with the `UnclassifiedPaper` class.

In [None]:
model_fastrp.predict_mutate(
    G,
    mutateProperty="predictedClass",
    modelName="cora-pipeline-model-fastrp",
    predictedProbabilityProperty="predictedProbabilities"
)

In [None]:
predicted_fastrp = gds.graph.streamNodeProperty(
    G,
    "predictedClass",
    ["UnclassifiedPaper"]
)

predicted_fastrp

In [None]:
# Retrieve node information from Neo4j using the node IDs from the prediction result
nodes = gds.util.asNodes(predicted_fastrp.nodeId.to_list())

# Create a new DataFrame containing node IDs along with node properties
nodes_df = pd.DataFrame([(node.id, node["subject"]) for node in nodes], columns=["nodeId", "subject"])

# Merge with the prediction result on node IDs, to check the predicted value
# against the original subject
#
# NOTE: This could also be replaced by just appending `node["subject"]` as a 
# Series since the node order would not change, but a proper merge (or join) 
# is clearer and less prone to errors.
predicted_fastrp.merge(nodes_df, on="nodeId")

As we can see, the prediction for all the holdout nodes is accurate.

## Write result back to Neo4j

In [None]:
model_fastrp.predict_write(
    G,
    writeProperty="predictedSubject",
    modelName="cora-pipeline-model-fastrp",
    predictedProbabilityProperty="predictedProbabilities",
)

## Cleanup

In [None]:
model.drop()
model_fastrp.drop()
node_pipeline.drop()
node_pipeline_fastrp.drop()

G.drop()
gds.run_cypher("MATCH (n) DETACH DELETE n")