# Exercise 01 - Projecting Parquet Files

In this case, we're going to create a super simple _monopartite_ graph like:

```
(:User)-[:REFERRED]->(:User)
```

We have 2 input files (for local users):
- `~/input/user.parquet` -- our users
- `~/input/referred.parquet` -- our relationships

For non-local users wanting to pull from the internet:
- `https://storage.googleapis.com/neo4j-se-public/training/user.parquet`
- `https://storage.googleapis.com/neo4j-se-public/training/referred.parquet`

In [7]:
%%capture
%pip install -U graphdatascience pandas ipywidgets
%pip install https://github.com/neo4j-field/checker/releases/download/0.4.0/checker-0.4.0.tar.gz

In [8]:
import pandas as pd
import answers.checker as c

from graphdatascience import GraphDataScience

In [9]:
# Update this if you're not running locally with the provided Docker instances.
USE_TLS = False
NEO4J_HOST = "neo4j.arrow"
NEO4J_URI = f"neo4j{'+s' * int(USE_TLS)}://{NEO4J_HOST}:7687"
NEO4J_AUTH = ("neo4j", "password")

---
<br><br>

## Task 1: Initialize the GDS Client

We need a client talking to our `NEO4J_HOST`, so let's initialize a connection.

Make sure the client referenced by a variable named `gds`.

> Don't forget to `set_database()` to "neo4j"!

In [10]:
gds = GraphDataScience(NEO4J_URI, auth=NEO4J_AUTH)
gds.set_database("neo4j")

### Task 1: Check Your Work

In [11]:
# Don't change this cell.
c.check_result("Ex 01", "Task 1", gds=gds)

🥳 Ex 01/Task 1 passed!


---
<br><br>

## Task 2: Collect Our Nodes

Let's first load our `User` nodes. They're in a single Parquet file, so it's easy to read it into a Pandas DataFrame using the `read_parquet` function.

> See the docs: https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html

In [12]:
users = pd.read_parquet("https://storage.googleapis.com/neo4j-se-public/training/user.parquet")

### Task 2: Check Your Work

In [13]:
# Don't change this cell.
c.check_result("Ex 01", "Task 2", users=users)
users

🥳 Ex 01/Task 2 passed!


Unnamed: 0,nodeId,fraudMoneyTransfer,labels
0,600214,0,User
1,589898,0,User
2,585889,0,User
3,609571,0,User
4,614918,0,User
...,...,...,...
33727,611471,0,User
33728,613116,0,User
33729,603472,0,User
33730,594318,0,User


---
<br><br>

## Task 3: Collect Your Relationships
This should be easy now that you've learned how to load your nodes.

> Remember: the relationships are in: `~/input/referred.parquet`

In [14]:
referred = pd.read_parquet("https://storage.googleapis.com/neo4j-se-public/training/referred.parquet")

### Task 3: Check Your Work

In [15]:
c.check_result("Ex 01", "Task 3", referred=referred)
referred

🥳 Ex 01/Task 3 passed!


Unnamed: 0,sourceNodeId,targetNodeId,relationshipType
0,589819,610510,REFERRED
1,594910,612799,REFERRED
2,588652,597869,REFERRED
3,606619,591810,REFERRED
4,601680,611058,REFERRED
...,...,...,...
1865,601597,604422,REFERRED
1866,598670,586725,REFERRED
1867,615396,604042,REFERRED
1868,586685,616031,REFERRED


---
<br>
<br>

## Task 4: Project the Graph
Now let's actually project our graph!

Take the node and relationship dataframes you've created and use `gds.alpha.graph.construct` to project a graph `G` named `"Exercise-01"`.

See https://neo4j.com/docs/graph-data-science-client/current/graph-object/#_constructing_a_graph


In [16]:
G = gds.alpha.graph.construct("Exercise-01", users, referred)

In [17]:
c.check_result("Ex 01", "Task 4", G=G)
G.node_count(), G.relationship_count(), G.node_labels(), G.relationship_types()

🥳 Ex 01/Task 4 passed!


(33732, 1870, ['User'], ['REFERRED'])

---
<br>
<br>

## Task 5: Run WCC

Now to make sure you really loaded the data correctly, let's run WCC (my favorite algo) and find the number of unique components. 

Make sure to store the results in an object called `wcc_components`.

> Hint: you can select the series from a DataFrame and run `.unique()` on it. See https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html#how-do-i-select-specific-columns-from-a-dataframe

In [18]:
wcc = gds.wcc.stream(G)
wcc_components = len(wcc["componentId"].unique())

In [19]:
c.check_result("Ex 01", "Task 5", wcc_components=wcc_components)
f"You found {wcc_components} components!"

🥳 Ex 01/Task 5 passed!


'You found 31909 components!'

---
<br>
<br>

## Bonus Task: Find the Top 10 Largest Component without Cypher
As an extra task, to test your Pandas aptitute, can you find the component ids of the _top 10 largest components_ without Cypher or persisting to the database?

> Assign this DataFrame to a variable named `top10`.

In [20]:
top10 = (
    wcc
    .groupby("componentId")
    .count()
    .rename(columns={"nodeId": "cnt"})
    .sort_values(by="cnt", ascending=False)[:10]
)

In [21]:
c.check_result("Ex 01", "Bonus", top10=top10)
top10

🥳 Ex 01/Bonus passed!


Unnamed: 0_level_0,cnt
componentId,Unnamed: 1_level_1
4882,7
2410,7
6907,7
4978,7
2384,6
6941,5
2031,5
4111,5
2312,5
8989,5


---
<br><br>

# Cleanup!🧹

Now you can `drop()` your graph.

In [22]:
G.drop()

graphName                                                  Exercise-01
database                                                         neo4j
memoryUsage                                                           
sizeInBytes                                                         -1
nodeCount                                                        33732
relationshipCount                                                 1870
configuration                                                       {}
density                                                       0.000002
creationTime                       2022-10-25T17:58:34.330453946+00:00
modificationTime                   2022-10-25T17:58:34.330401317+00:00
schema               {'graphProperties': {}, 'relationships': {'REF...
Name: 0, dtype: object