# Running Graph Analytics on Large Scale Graphs Effortlessly with Nvidia and Memgraph 

This tutorial will show you how to use  **PageRank** graph analysis and **Louvain** community detection on a **Facebook dataset** containing 1.3 million relationships. Upon completing it, you will know how to run analytics algorithms on your dataset using **Python**. Then, you can run any of the following algorithms:
* Balanced Cut (clustering)
* Spectral Clustering (clustering)
* HITS (hubs vs. authorities analytics)
* Leiden (community detection)
* Katz Centrality
* Betweenness Centrality

All of the algorithms above are powered by **[Nvidia cuGraph](https://rapids.ai/)** and they will execute on **GPU**.



## Prerequisites

To follow the tutorial, please install:
- [Docker](https://docs.docker.com/get-docker/) - needed to run `memgraph/memgraph-mage:1.3-cugraph-22.02-cuda-11.5` image we use
- [Jupyter](https://jupyter.org/install) - using jupyter-notebook you can write  `CSV` importing and graph analytics in one file
- [GQLAlchemy](https://pypi.org/project/gqlalchemy/) - used to connect Memgraph with Python
- [Memgraph Lab](https://memgraph.com/lab) - a GUI tool we use to visualize graphs

Here are brief instructions on how to install everything you will need.

### Memgraph with Docker
We need *[Docker](https://www.docker.com/)* because Memgraph is a native Linux application and can't be installed on Windows and macOS. In this tutorial, we will use `memgraph/memgraph-mage:1.3-cugraph-22.02-cuda-11.5` Docker image. Check our [guide](https://memgraph.com/docs/mage/installation/cugraph) on setting everything up. Note, all `cuGraph` Docker images are available [here](https://hub.docker.com/r/memgraph/memgraph-mage/tags?page=1&name=cugraph). Depending on CUDA drivers on your machine, be sure to Download correct image. Check out compatibility between drivers and CUDA on official [NVIDIA page](https://docs.nvidia.com/deploy/cuda-compatibility/index.html).


Before running the `memgraph/memgraph-mage:1.3-cugraph-22.02-cuda-11.5` image, position yourself inside the `jupyter-memgraph-tutorials/cugraph-analytics` folder with the terminal. When you get there, we’ll show you a nifty "hack" with the docker run command.


All the data we need is inside `.csv` files, and Memgraph needs to have access to those files. But because we will run Memgraph within a Docker container and the files are currently on our machine, we need to transfer them inside the same container where the Memgraph will be running. So let's create a Docker volume by mounting our current `data/facebook_clean_data/` folder to the `/samples` folder inside the Docker containers. `.csv` files will be located inside the `/samples` folder within the Docker, where *Memgraph* will find them when needed.

Now we can start the `Docker image` image:

```
docker run -it -p 7687:7687 -p 7444:7444 --volume ./data/facebook_clean_data/:/samples memgraph/memgraph-mage:1.3-cugraph-22.02-cuda-11.5
```
If successful, you should see a message similar to the following:
```
You are running Memgraph vX.X.X
To get started with Memgraph, visit https://memgr.ph/start
```

### Jupyter notebook
With Memgraph running, let’s install **Jupyter**. We used **jupyter-lab**, which you can install as follows:
```
pip install jupyterlab
```

As mentioned before, `.csv` files holding the dataset we will use in the tutorial are located  in the repository folder `cugraph-analytics/data`.

### GQLAlchemy installation
**[GQLAlchemy](https://memgraph.com/docs/gqlalchemy/)** is an object graph mapper (OGM) used to connect to Memgraph and execute queries using **Python**. You can think of Cypher as SQL for graph databases. It contains many of the same language constructs such as CREATE, UPDATE, DELETE, etc...

Go to the GQLAlachemy [installation](https://memgraph.com/docs/gqlalchemy/installation) page for installation instructions and more information. If you have `CMake`, the easiest way to install `GQLAlchemy` is with `pip`:

```
pip install gqlalchemy
```

### Memgraph Lab installation
The last piece of tech you will need is **[Memgraph Lab](https://memgraph.com/lab)**.  You will use it to connect to **Memgraph** and create data visualizations. Check out how to install **Memgraph Lab**  as a desktop application for [Windows](https://memgraph.com/docs/memgraph-lab/installation/windows#step-2---installing-and-setting-up-memgraph-lab), [Mac](https://memgraph.com/docs/memgraph-lab/installation/macos#step-2---installing-and-setting-up-memgraph-lab)  or [Linux](https://memgraph.com/docs/memgraph-lab/installation/linux#step-2---installing-and-setting-up-memgraph-lab), and then [connect to Memgraph database](https://memgraph.com/docs/memgraph-lab/connect-to-memgraph#connecting-to-memgraph).

With the **Memgraph Lab** installed and connected, we are ready to connect to Memgraph with **GQLAlchemy**, import our large dataset and run graph analytics using **Python**.

## Connecting to Memgraph with GQLAlchemy

The next three lines of code will import `qglalchemy`, connect to **Memgraph** database instance via `host:127.0.0.1` and `port:7687`, and clear the database, just to be sure we are starting with a clean slate.

In [21]:
from gqlalchemy import Memgraph

In [22]:
memgraph = Memgraph("127.0.0.1", 7687)

In [3]:
memgraph.drop_database()

Let's import the dataset from `.csv` files and learn how to perform **PageRank** and **Louvain community detection** using **Python**.

## Importing data

The `.csv` files containing the [**Facebook** dataset](https://snap.stanford.edu/data/gemsec-Facebook.html) have the following structure:
```
node_1,node_2
0,1794
0,3102
0,16645
```
The dataset consists of verified Facebook pages belonging to various categories and dating back to November 2017. Each node stands for a page, and relationships represent mutual likes. The nodes are reindexed (starting from 0) to achieve a certain level of anonymity. Because Memgraph imports indexed data faster, we will create an index for `Page` nodes over the `id` property."

In [4]:
memgraph.execute(
    """
    CREATE INDEX ON :Page(id);
    """
)

Now, to make full use of our "hack" from before, let's list through our local files  in the `./data/facebook_clean_data/` folder to create their paths by concatenating the file names and the `/samples/` folder. Those paths will then represent the paths to the `.csv` files in the Docker container.

In [7]:
import os
from os import listdir
from os.path import isfile, join
csv_dir_path = os.path.abspath("./data/facebook_clean_data/")
csv_files = [join(csv_dir_path, f) for f in listdir(csv_dir_path) if isfile(join(csv_dir_path, f))]


In [8]:
import os
from os import listdir
from os.path import isfile, join
csv_dir_path = os.path.abspath("./data/facebook_clean_data/")
csv_files = [f"/samples/{f}" for f in listdir(csv_dir_path) if isfile(join(csv_dir_path, f))]

Once we have all the `.csv` files, we can load them with the following query:

In [None]:
for csv_file_path in csv_files:
    memgraph.execute(
        f"""
        LOAD CSV FROM "{csv_file_path}" WITH HEADER AS row
        MERGE (p1:Page {{id: row.node_1}})
        MERGE (p2:Page {{id: row.node_2}})
        MERGE (p1)-[:LIKES]->(p2);
        """
    )

You can find out more about the `LOAD CSV` clause for importing `.csv` files in our [docs](https://memgraph.com/docs/memgraph/import-data/load-csv-clause).

We are all set to use PageRank and Louvain community detection algorithms with Python
to find out which pages in our network are most important and to find all the communities
we have in a network.

## PageRank importance analysis
Now, let's execute PageRank to find the important pages of the Facebook dataset. To read more about how **Pagerank** works,  go to our **[docs](https://memgraph.com/docs/mage/query-modules/cpp/pagerank)** page. All algorithms mentioned in the [introduction](#introduction) were developed by **[Nvidia](https://rapids.ai/)**, and they come integrated within **Memgraph MAGE**. Our goal in **Memgraph** is to make it easy for you to use fast algorithms on graph databases.

The graph algorithms in MAGE are implemented in C++ or Python, and include a large selection ranging from **graph neural networks** to various centrality measures. Find out more at the [MAGE docs page](https://memgraph.com/docs/mage/algorithms) and in our [tutorials](https://memgraph.com/categories/tutorials) on how to use such analytics to classify nodes, predict relationships, and much more! Everything inside MAGE is integrated in a way to make PageRank easy and quick. The following query executes the algorithm and stores the computed PageRank scores in the `rank` property, and it only takes ~4 seconds for our graph of more than 1 million edges!


In [11]:
  memgraph.execute(
        """
        CALL cugraph.pagerank.get() YIELD node,rank
        SET node.rank = rank;
        """
    )

Now, ranks are ready and you can retrieve them with the following Python call:

In [None]:
results =  memgraph.execute_and_fetch(
        """
        MATCH (n)
        RETURN n.id as node, n.rank as rank
        ORDER BY rank DESC
        LIMIT 10;
        """
    )
for dict_result in results:
    print(f"node id: {dict_result['node']}, rank: {dict_result['rank']}")

 With the above code, we got the 10 nodes with the highest rank score in a Python dictionary. Time to visualize the results with **[Memgraph Lab](https://memgraph.com/lab)**!

Open `Execute Query` view in **Memgraph Lab** and run the following query:
```
MATCH (n)
WITH n
ORDER BY n.rank DESC
LIMIT 3
MATCH (n)<-[e]-(m)
RETURN *;
```

In the first part of this query, we first `MATCH` all the nodes. In the second part of the query we `ORDER` nodes by their `rank` in descending order, and for the first `3` of them get all pages that are connected to them. We need the `WITH` clause to connect the two parts of the query.


Besides creating beautiful visualizations powered by [D3.js](https://d3js.org/) and our [graph style script](https://memgraph.com/docs/memgraph-lab/graph-style-script-language), you can use Memgraph Lab to [query your graph database](https://memgraph.com/docs/memgraph-lab/connect-to-memgraph#executing-queries) for the following:
 * write graph algorithms in **Python**, **C++**, or even **Rust**
 * check Memgraph Database Logs
 * visualize graph schema
 * import/export dataset

If you don't have your own dataset at hand, there are plenty of datasets available in Memgraph Lab that you can explore. Everything you might need to know about Memgraph Lab can be found in our [docs](https://memgraph.com/docs/memgraph-lab/).

Now, that's it concerning PageRank, next you will see how to use Louvain community detection to find communities in the graph.

## Community detection with Louvain
The Louvain algorithm measures how connected the nodes within a community are if we would compare them to how connected they would be in a random network. Also, it recursively merges communities into a single node and executes the modularity clustering on the condensed graphs. This is one of the most popular community detection algorithms. Let's run it to find how many communities there are inside the graph. First, we will let Louvain execute and save `cluster_id` as a property for every node.

In [23]:
memgraph.execute(
    """
    CALL cugraph.louvain.get() YIELD cluster_id, node
    SET node.cluster_id = cluster_id;
    """
)

In [None]:
results =  memgraph.execute_and_fetch(
        """
        MATCH (n)
        WITH DISTINCT n.cluster_id as cluster_id
        RETURN count(cluster_id ) as num_of_clusters;
        """
    )
# we will get only 1 result
result = list(results)[0]

#don't forget that results are saved in a dict
print(f"Number of clusters: {result['num_of_clusters']}")


You can now dive in and explore the graph's community structure. For example, let's find *border nodes* (they belong to one community, but are connected to node(s) in other communities. The Louvain method attempts to minimize their number, so we shouldn't see very many. Run the following query in Memgraph Lab:

```
MATCH  (n2)<-[e1]-(n1)-[e]->(m1)
WHERE n1.cluster_id != m1.cluster_id AND n1.cluster_id = n2.cluster_id
RETURN *
LIMIT 1000;
```
This query `MATCH`es node `n1` and its relationships to nodes `n2` and `m1` with
respectively:`(n2)<-[e1]-(n1)` and `(n1)-[e]->(m1)`. Next, it keeps (`FILTER`)
only those matches `WHERE` `n1` and `n2` have a different `cluster_id` than `m1`.
Using `LIMIT 1000`, we limit the results to show at most 1000 such relationships for easier visualization.

## Where to next?
And there you have it - millions of nodes imported and two graph analytics algorithms used.
Now you can import huge graphs and do the analytics you want in a matter of seconds.
If you like what we do, don't hesitate to give us a star on  **[Memgraph](https://github.com/memgraph/memgraph)**, **[Memgraph MAGE](https://github.com/memgraph/mage)** and also don't forget to give a star to devs in **[Rapids.ai cuGraph](https://github.com/rapidsai/cugraph)** .

If you have any questions, feel free to ask us Memgraphers on **[Discord](https://discord.gg/memgraph)**.


Onwards and upwards!