# Exercise 02 - Pulling out Graph Features at Scale

Since we covered the basics of data loading in `Exercise 01`, we're going to skip that here and quickly start with a "prebuilt" graph. (Just run the first few cells below.)

This time, however, we're going to use a _bipartite_ graph:

```
(:User)-[:HAS_IP]->(:IP)
```


In [None]:
%%capture
%pip install -U graphdatascience pandas ipywidgets
%pip install https://github.com/neo4j-field/checker/releases/download/0.4.1/checker-0.4.1.tar.gz

In [None]:
import pandas as pd
import answers.checker as c

from graphdatascience import GraphDataScience

In [None]:
# Update this if you're running locally with the provided Docker instances.
USE_TLS = True
NEO4J_HOST = "nodes.neo4j.academy"
NEO4J_URI = f"neo4j{'+s' * int(USE_TLS)}://{NEO4J_HOST}:7687"
NEO4J_AUTH = ("user255", "xxxx")

In [None]:
# If you're running locally, use the following:

users = pd.read_parquet("https://storage.googleapis.com/neo4j-se-public/training/user.parquet")
ips = pd.read_parquet("https://storage.googleapis.com/neo4j-se-public/training/ip.parquet")
has_ip = pd.read_parquet("https://storage.googleapis.com/neo4j-se-public/training/has_ip.parquet")

# Preview of our Data

Let's take a quick look at what our nodes and relationships look like.

In [None]:
ips

In [None]:
users

In [None]:
has_ip

In [None]:
gds = GraphDataScience(NEO4J_URI, auth=NEO4J_AUTH)
gds.set_database(NEO4J_AUTH[0])

In [None]:
G = gds.alpha.graph.construct("Exercise-02", [ips, users], [has_ip])

## Generating some Features

Let's pretend we need to generate node embeddings representing `User`s and their relationships to other `User`s by `IP`.

---
<br><br>

### Task 1. Mutate into a Monopartite Similarity Graph
Your first step is to create a monopartite representation using **Node Similarity**.

Generate a new relationship type called `SIMILAR_BY_IP` with a relationship weight called `score`.

> You might want to increase the `concurrency` setting. This should only take about 1 minute or so.

In [None]:
# Mutate our graph here...


In [None]:
# Don't change this cell.
c.check_result("Ex 02", "Task 1", G=G)

---
<br><br>

### Task 2. Generate the Embeddings

Now we generate the node embedding vectors! We'll use **FastRP**. 

Mutate the graph and store the embeddings in a property called `fastRP` and make sure to use an embedding of `256`.

In [None]:
# Mutate our graph here...


In [None]:
# Don't change this cell.
c.check_result("Ex 02", "Task 2", G=G)

---
<br><br>

### Task 3. Dump our Vectors using the Power of Apache Arrow 🏹

Now's where the magic happens! Pull back **all** the node embeddings into a single DataFrame. If you tried this with the native Python driver, you'd be twiddling your thumbs for quite some time.

> Bonus points if you know how to get the resulting DataFrame to call the embedding vector "fastRP".

Call this new DataFrame `df`.

In [None]:
%time df = None

In [None]:
# Don't change this cell.
c.check_result("Ex 02", "Task 3", df=df)

### Aside: Ok...is Apache Arrow 🏹 even doing anything? 🤔

Proof is in the pudding 🍮, so we'll now show how long it takes to pull back a _fraction_ of the same data using the traditional Python driver calling the `gds.graph.streamNodeProperty` stored procedure.

In [None]:
from neo4j import GraphDatabase
import pandas, time

df2 = None
t0, t1 = 0.0, 0.0
sz = 50_000
with GraphDatabase.driver(NEO4J_URI, auth=NEO4J_AUTH) as d:
    with d.session() as s:
        t0 = time.time()
        result = s.run("""
          CALL gds.graph.nodeProperty.stream('Exercise-02', 'fastRP')
          YIELD nodeId, propertyValue
          RETURN * LIMIT $limit;
        """, limit=sz)
        df2 = result.to_df()
        t1 = time.time()

print(f"It took {int(t1 - t0):,} seconds to return your (partial) DataFrame. 🥱")
print(f"It would probably take {int(len(df) / (sz / (t1 - t0))):,} seconds to return the total thing! 🤯")
display(df2)

---
<br><br>

# Cleanup!🧹

Now you can `drop()` your graph.

In [None]:
G.drop()