<a href="https://colab.research.google.com/github/dcolinmorgan/grph/blob/main/simple_GFQL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# simple GFQL demo on Twitter data

* Twitter	Network with 81,306 Nodes	and 2,420,766 Edges

* The single-threaded CPU mode benefits from GFQL's novel dataframe engine, and the GPU mode further adds single-GPU acceleration. Both the chain() and hop() methods are examined.

* The benchmark does not examine bigger-than-memory and distributed scenarios. The provided results here are from running on a free Google Colab T4 runtime, with a 2.2GHz Intel CPU (12 GB CPU RAM) and T4 Nvidia GPU (16 GB GPU RAM).

## Install, Import, Load

In [5]:
# !pip install --extra-index-url=https://pypi.nvidia.com cuml-cu12 cudf-cu12
import cudf
cudf.__version__

!pip install -q igraph
!pip install -q graphistry

import pandas as pd
import graphistry, time, cProfile

from graphistry import (

    # graph operators
    n, e_undirected, e_forward, e_reverse,

    # attribute predicates
    is_in, ge, startswith, contains, match as match_re
)
graphistry.__version__

'0.33.0'

In [2]:
te_df = pd.read_csv('https://snap.stanford.edu/data/twitter_combined.txt.gz', sep=' ', names=['s', 'd'])
g = graphistry.edges(te_df, 's', 'd').materialize_nodes()

(81306, 1)

## .chain() CPU v GPU

In [None]:
start = time.time()

for i in range(10):
  g2 = g.chain([n({'id': 17116707}), e_forward(hops=1)])
g2._nodes.shape, g2._edges.shape

end1 = time.time()
T1 = end1 - start

In [17]:
start = time.time()

g_gdf = g.nodes(lambda g: cudf.DataFrame(g._nodes)).edges(lambda g: cudf.DataFrame(g._edges))
for i in range(10):
  out = g_gdf.chain([n({'id': 17116707}), e_forward(hops=1)])._nodes
del g_gdf
del out

end2 = time.time()
T2= end2 - start
print('CPU time:',T1, '\nGPU time:', T2, '\nspeedup:', T1/T2)

CPU time: 17.837570190429688 
GPU time: 2.0647764205932617 
speedup: 8.638983868919091


## .hop() CPU v GPU

*   simpler tasks can witness greater speedup



In [None]:
start = time.time()
start_nodes = pd.DataFrame({g._node: [17116707]})
for i in range(10):
  g2 = g.hop(
      nodes=start_nodes,
      direction='forward',
      hops=8)

end1 = time.time()
T1 = end1 - start

In [26]:
start = time.time()
start_nodes = cudf.DataFrame({g._node: [17116707]})
g_gdf = g.nodes(cudf.from_pandas(g._nodes)).edges(cudf.from_pandas(g._edges))
for i in range(10):
  g2 = g_gdf.hop(
      nodes=start_nodes,
      direction='forward',
      engine = 'cudf',  # one can also set `engine = cudf`
      hops=8)
del start_nodes
del g_gdf
del g2

end2 = time.time()
T2= end2 - start
print('CPU time:',T1, '\nGPU time:', T2, '\nspeedup:', T1/T2)

CPU time: 40.91506862640381 
GPU time: 2.8351004123687744 
speedup: 14.431611821543413
