# Exercise 02 - Pulling out Graph Features at Scale

Since we covered the basics of data loading in `Exercise 01`, we're going to skip that here and quickly start with a "prebuilt" graph. (Just run the first few cells below.)

This time, however, we're going to use a _bipartite_ graph:

```
(:User)-[:HAS_IP]->(:IP)
```


In [1]:
%%capture
%pip install -U graphdatascience pandas ipywidgets
%pip install https://github.com/neo4j-field/checker/releases/download/0.4.1/checker-0.4.1.tar.gz

In [2]:
import pandas as pd
import answers.checker as c

from graphdatascience import GraphDataScience

In [3]:
# Update this if you're running locally with the provided Docker instances.
USE_TLS = True
NEO4J_HOST = "nodes.neo4j.academy"
NEO4J_URI = f"neo4j{'+s' * int(USE_TLS)}://{NEO4J_HOST}:7687"
NEO4J_AUTH = ("user255", "xxxx")

In [4]:
# If you're running locally, use the following:

users = pd.read_parquet("https://storage.googleapis.com/neo4j-se-public/training/user.parquet")
ips = pd.read_parquet("https://storage.googleapis.com/neo4j-se-public/training/ip.parquet")
has_ip = pd.read_parquet("https://storage.googleapis.com/neo4j-se-public/training/has_ip.parquet")

# Preview of our Data

Let's take a quick look at what our nodes and relationships look like.

In [5]:
ips

Unnamed: 0,nodeId,labels
0,343650,IP
1,266242,IP
2,279425,IP
3,307641,IP
4,180641,IP
...,...,...
585850,226704,IP
585851,104882,IP
585852,176706,IP
585853,15143,IP


In [6]:
users

Unnamed: 0,nodeId,fraudMoneyTransfer,labels
0,600214,0,User
1,589898,0,User
2,585889,0,User
3,609571,0,User
4,614918,0,User
...,...,...,...
33727,611471,0,User
33728,613116,0,User
33729,603472,0,User
33730,594318,0,User


In [7]:
has_ip

Unnamed: 0,sourceNodeId,targetNodeId,relationshipType
0,600214,151725,HAS_IP
1,600214,232299,HAS_IP
2,600214,127013,HAS_IP
3,600214,41560,HAS_IP
4,600214,434917,HAS_IP
...,...,...,...
1488944,603472,140230,HAS_IP
1488945,603472,367612,HAS_IP
1488946,594318,536671,HAS_IP
1488947,588912,179487,HAS_IP


In [8]:
gds = GraphDataScience(NEO4J_URI, auth=NEO4J_AUTH)
gds.set_database(NEO4J_AUTH[0])

In [9]:
G = gds.alpha.graph.construct("Exercise-02", [ips, users], [has_ip])

## Generating some Features

Let's pretend we need to generate node embeddings representing `User`s and their relationships to other `User`s by `IP`.

---
<br><br>

### Task 1. Mutate into a Monopartite Similarity Graph
Your first step is to create a monopartite representation using **Node Similarity**.

Generate a new relationship type called `SIMILAR_BY_IP` with a relationship weight called `score`.

> You might want to increase the `concurrency` setting. This should only take about 1 minute or so.

In [10]:
# Mutate our graph here...
gds.nodeSimilarity.mutate(G, mutateRelationshipType='SIMILAR_BY_IP', mutateProperty="score")

NodeSimilarity:   0%|          | 0/100 [00:00<?, ?%/s]

preProcessingMillis                                                       1
computeMillis                                                         96882
mutateMillis                                                            127
postProcessingMillis                                                     -1
nodesCompared                                                         33700
relationshipsWritten                                                 330132
similarityDistribution    {'p1': 0.01694916933774948, 'max': 1.000007621...
configuration             {'topK': 10, 'similarityMetric': 'JACCARD', 'b...
Name: 0, dtype: object

In [11]:
# Don't change this cell.
c.check_result("Ex 02", "Task 1", G=G)

🥳 Ex 02/Task 1 passed!


---
<br><br>

### Task 2. Generate the Embeddings

Now we generate the node embedding vectors! We'll use **FastRP**. 

Mutate the graph and store the embeddings in a property called `fastRP` and make sure to use an embedding of `256`.

In [12]:
# Mutate our graph here...
gds.fastRP.mutate(G, 
                  mutateProperty="fastRP", 
                  embeddingDimension=256,
                  iterationWeights=[0.0, 1.0, 1.0])

FastRP:   0%|          | 0/100 [00:00<?, ?%/s]

nodePropertiesWritten                                               619587
mutateMillis                                                             0
nodeCount                                                           619587
preProcessingMillis                                                      0
computeMillis                                                         2197
configuration            {'jobId': '8ea19d5d-0ec3-46f0-84aa-56d08a6d4fa...
Name: 0, dtype: object

In [13]:
# Don't change this cell.
c.check_result("Ex 02", "Task 2", G=G)

🥳 Ex 02/Task 2 passed!


---
<br><br>

### Task 3. Dump our Vectors using the Power of Apache Arrow 🏹

Now's where the magic happens! Pull back **all** the node embeddings into a single DataFrame. If you tried this with the native Python driver, you'd be twiddling your thumbs for quite some time.

> Bonus points if you know how to get the resulting DataFrame to call the embedding vector "fastRP".

Call this new DataFrame `df`.

In [14]:
%time df = gds.graph.streamNodeProperty(G, "fastRP")
df

CPU times: user 1.14 s, sys: 763 ms, total: 1.91 s
Wall time: 24.2 s


Unnamed: 0,nodeId,propertyValue
0,320000,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,320001,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,320002,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,320003,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,320004,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
...,...,...
619582,319995,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
619583,319996,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
619584,319997,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
619585,319998,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [15]:
# Don't change this cell.
c.check_result("Ex 02", "Task 3", df=df)

🥳 Ex 02/Task 3 passed!


### Aside: Ok...is Apache Arrow 🏹 even doing anything? 🤔

Proof is in the pudding 🍮, so we'll now show how long it takes to pull back a _fraction_ of the same data using the traditional Python driver calling the `gds.graph.streamNodeProperty` stored procedure.

In [16]:
from neo4j import GraphDatabase
import pandas, time

df2 = None
t0, t1 = 0.0, 0.0
sz = 50_000
with GraphDatabase.driver(NEO4J_URI, auth=NEO4J_AUTH) as d:
    with d.session() as s:
        t0 = time.time()
        result = s.run("""
          CALL gds.graph.nodeProperty.stream('Exercise-02', 'fastRP')
          YIELD nodeId, propertyValue
          RETURN * LIMIT $limit;
        """, limit=sz)
        df2 = result.to_df()
        t1 = time.time()

print(f"It took {int(t1 - t0):,} seconds to return your (partial) DataFrame. 🥱")
print(f"It would probably take {int(len(df) / (sz / (t1 - t0))):,} seconds to return the total thing! 🤯")
display(df2)

It took 21 seconds to return your (partial) DataFrame. 🥱
It would probably take 267 seconds to return the total thing! 🤯


Unnamed: 0,nodeId,propertyValue
0,0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,1,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,2,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,3,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,4,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
...,...,...
49995,49995,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
49996,49996,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
49997,49997,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
49998,49998,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


---
<br><br>

# Cleanup!🧹

Now you can `drop()` your graph.

In [17]:
G.drop()

graphName                                                  Exercise-02
database                                                       user255
memoryUsage                                                           
sizeInBytes                                                         -1
nodeCount                                                       619587
relationshipCount                                              1819081
configuration                                                       {}
density                                                       0.000005
creationTime                       2022-10-26T12:48:14.528724000+00:00
modificationTime                   2022-10-26T12:49:55.225083000+00:00
schema               {'graphProperties': {}, 'relationships': {'HAS...
Name: 0, dtype: object