# Exercise 01 - Projecting Parquet Files

In this case, we're going to create a super simple _monopartite_ graph like:

```
(:User)-[:REFERRED]->(:User)
```

We have 2 input files (for local users):
- `~/input/user.parquet` -- our users
- `~/input/referred.parquet` -- our relationships

For non-local users wanting to pull from the internet:
- `https://storage.googleapis.com/neo4j-se-public/training/user.parquet`
- `https://storage.googleapis.com/neo4j-se-public/training/referred.parquet`

In [None]:
%%capture
%pip install -U graphdatascience pandas ipywidgets
%pip install https://github.com/neo4j-field/checker/releases/download/0.4.1/checker-0.4.1.tar.gz

In [None]:
import pandas as pd
import answers.checker as c

from graphdatascience import GraphDataScience

In [None]:
# Update this if you're running locally with the provided Docker instances.
USE_TLS = True
NEO4J_HOST = "nodes.neo4j.academy"
NEO4J_URI = f"neo4j{'+s' * int(USE_TLS)}://{NEO4J_HOST}:7687"
NEO4J_AUTH = ("user255", "xxxx")

---
<br><br>

## Task 1: Initialize the GDS Client

We need a client talking to our `NEO4J_HOST`, so let's initialize a connection.

Make sure the client referenced by a variable named `gds`.

> Don't forget to `set_database()`!

In [None]:
gds = None

### Task 1: Check Your Work

In [None]:
# Don't change this cell.
c.check_result("Ex 01", "Task 1", gds=gds)

---
<br><br>

## Task 2: Collect Our Nodes

Let's first load our `User` nodes. They're in a single Parquet file, so it's easy to read it into a Pandas DataFrame using the `read_parquet` function.

> See the docs: https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html

In [None]:
users = None

### Task 2: Check Your Work

In [None]:
# Don't change this cell.
c.check_result("Ex 01", "Task 2", users=users)
users

---
<br><br>

## Task 3: Collect Your Relationships
This should be easy now that you've learned how to load your nodes.

> Remember: the relationships are in: `https://storage.googleapis.com/neo4j-se-public/training/referred.parquet`

In [None]:
referred = None

### Task 3: Check Your Work

In [None]:
c.check_result("Ex 01", "Task 3", referred=referred)
referred

---
<br>
<br>

## Task 4: Project the Graph
Now let's actually project our graph!

Take the node and relationship dataframes you've created and use `gds.alpha.graph.construct` to project a graph `G` named `"Exercise-01"`.

See https://neo4j.com/docs/graph-data-science-client/current/graph-object/#_constructing_a_graph


In [None]:
G = None

In [None]:
c.check_result("Ex 01", "Task 4", G=G)
G.node_count(), G.relationship_count(), G.node_labels(), G.relationship_types()

---
<br>
<br>

## Task 5: Run WCC

Now to make sure you really loaded the data correctly, let's run WCC (my favorite algo) and find the number of unique components. 

Make sure to store the results in an object called `wcc_components`.

> Hint: you can select the series from a DataFrame and run `.unique()` on it. See https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html#how-do-i-select-specific-columns-from-a-dataframe

In [None]:
wcc = None

In [None]:
c.check_result("Ex 01", "Task 5", wcc_components=wcc_components)
f"You found {wcc_components} components!"

---
<br>
<br>

## Bonus Task: Find the Top 10 Largest Component without Cypher
As an extra task, to test your Pandas aptitute, can you find the component ids of the _top 10 largest components_ without Cypher or persisting to the database?

> Assign this DataFrame to a variable named `top10`.

In [None]:
top10 = None

In [None]:
c.check_result("Ex 01", "Bonus", top10=top10)
top10

---
<br><br>

# Cleanup!🧹

Now you can `drop()` your graph.

In [None]:
G.drop()