<a href="https://colab.research.google.com/github/neo4j/graph-data-science-client/blob/main/examples/load-data-via-graph-construction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Load data to a projected graph via graph construction

This notebook shows the usage of the `gds.alpha.graph.construct` method (available only in GDS 2.1+) to build a graph directly in memory.

**NOTE:** If you are using AuraDS, it is currently not possible to write the projected graph back to Neo4j.

## Setup

We need an environment where Neo4j and GDS are available, for example AuraDS (which comes with GDS preinstalled) or Neo4j Desktop. 

Once the credentials to this environment are available, we can install the `graphdatascience` package and create the `gds` object.

In [1]:
!pip install graphdatascience==1.4

[31mERROR: Could not find a version that satisfies the requirement graphdatascience=={docs-version} (from versions: 0.0.7, 0.0.8, 0.0.9, 0.1.0, 1.0.0, 1.1.0a1, 1.1.0a2, 1.1.0rc1, 1.1.0, 1.2.0, 1.3.0a1, 1.3)[0m[31m
[0m[31mERROR: No matching distribution found for graphdatascience=={docs-version}[0m[31m
[0m

In [2]:
# Import the client
from graphdatascience import GraphDataScience

# Replace with the actual credentials
AURA_CONNECTION_URI = "neo4j+s://xxxxxxxx.databases.neo4j.io"
AURA_USERNAME = "neo4j"
AURA_PASSWORD = ""

# Configure the client with AuraDS-recommended settings if using AuraDS
gds = GraphDataScience(AURA_CONNECTION_URI, auth=(AURA_USERNAME, AURA_PASSWORD), aura_ds=True)

ModuleNotFoundError: No module named 'graphdatascience'

We also import `pandas` to create a Pandas `DataFrame` from the original data source.

In [None]:
import pandas as pd

## Load the Cora dataset

In [None]:
# TODO: use URLs within the client repo when the notebook is added there
CORA_CONTENT = (
    "https://raw.githubusercontent.com/neo4j/graph-data-science/master/test-utils/src/main/resources/cora.content"
)
CORA_CITES = (
    "https://raw.githubusercontent.com/neo4j/graph-data-science/master/test-utils/src/main/resources/cora.cites"
)

We can load each CSV locally as a Pandas `DataFrame`.

In [None]:
content = pd.read_csv(CORA_CONTENT, header=None)
cites = pd.read_csv(CORA_CITES, header=None)

We need to perform an additional preprocessing step to convert the `subject` field (which is a string in the dataset) into an integer, because node properties have to be numerical in order to be projected into a graph. We can use a map for this.

In [None]:
SUBJECT_TO_ID = {
    "Neural_Networks": 0,
    "Rule_Learning": 1,
    "Reinforcement_Learning": 2,
    "Probabilistic_Methods": 3,
    "Theory": 4,
    "Genetic_Algorithms": 5,
    "Case_Based": 6,
}

We can now reate a new `DataFrame` with a `nodeId` field, a list of node labels,
and the additional node properties `subject` (using the `SUBJECT_TO_ID` 
mapping) and `features` (converting all the feature columns to a single
array column).

In [None]:
nodes = pd.DataFrame().assign(
    nodeId=content[0],
    labels="Paper",
    subject=content[1].replace(SUBJECT_TO_ID),
    features=content.iloc[:, 2:].apply(list, axis=1),
)

Let's check the first 5 rows of the new `DataFrame`:

In [None]:
nodes.head()

Now we create a new `DataFrame` containing the relationships between the nodes.
To create the equivalent of an undirected graph, we need to add direct
and inverse relationships explicitly.

In [None]:
dir_relationships = pd.DataFrame().assign(sourceNodeId=cites[0], targetNodeId=cites[1], relationshipType="CITES")
inv_relationships = pd.DataFrame().assign(sourceNodeId=cites[1], targetNodeId=cites[0], relationshipType="CITES")

relationships = pd.concat([dir_relationships, inv_relationships]).drop_duplicates()

Again, let's check the first 20 rows of the new `DataFrame`:

In [None]:
relationships.head(5)

Finally, we can create the in-memory graph.

In [None]:
G = gds.alpha.graph.construct("cora-graph", nodes, relationships)

## Use the graph

Let's check that the new graph has been created:

In [None]:
gds.graph.list()

Let's also count the nodes in the graph:

In [None]:
G.node_count()

We can stream the value of the `subject` node property for
each node in the graph, printing only the first 10.

In [None]:
gds.graph.streamNodeProperties(G, ["subject"]).head(10)

## Cleanup

In [None]:
G.drop()