## 1. Install & configure

# Graph Pattern Mining with hop() and gfql()

This tutorial demonstrates how to use PyGraphistry's `hop()` and `gfql()` methods for graph pattern mining and traversal.

**Key concepts:**
- `g.hop()`: Filter by source node → edge → destination node patterns
- `g.gfql()`: Chain multiple node and edge filters for complex patterns
- Predicates: Use comparisons, string matching, and other filters
- Result labeling: Name intermediate results for analysis

We'll explore these concepts using a US Congress Twitter interaction dataset.

In [None]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file if present
load_dotenv()

# Graphistry Configuration
# Manual configuration overrides (uncomment and modify as needed)
GRAPHISTRY_CONFIG = {
    # 'api': 3,
    # 'username': 'your_username',
    # 'password': 'your_password',
    # 'protocol': 'https',
    # 'server': 'hub.graphistry.com'
}

# Load configuration with hierarchical precedence: manual > env vars > .env file
config = {
    'api': GRAPHISTRY_CONFIG.get('api', int(os.getenv('GRAPHISTRY_API', '3'))),
    'username': GRAPHISTRY_CONFIG.get('username', os.getenv('GRAPHISTRY_USERNAME')),
    'password': GRAPHISTRY_CONFIG.get('password', os.getenv('GRAPHISTRY_PASSWORD')), 
    'protocol': GRAPHISTRY_CONFIG.get('protocol', os.getenv('GRAPHISTRY_PROTOCOL', 'https')),
    'server': GRAPHISTRY_CONFIG.get('server', os.getenv('GRAPHISTRY_SERVER', 'hub.graphistry.com'))
}

# Filter out None values and register
config = {k: v for k, v in config.items() if v is not None}
if config:
    graphistry.register(**config)
    print("✅ Graphistry configured successfully")
    if config.get('server'):
        print(f"   Server: {config.get('server')}")
    if config.get('username'):
        print(f"   Username: {config.get('username')}")
else:
    print("⚠️  Graphistry not configured. Please set credentials in GRAPHISTRY_CONFIG or environment variables.")

# For more options: https://pygraphistry.readthedocs.io/en/latest/server/register.html

In [None]:
import pandas as pd
import graphistry
from graphistry.compute.predicates import is_in, gt, lt, ge, le, eq, ne
from graphistry.compute.predicates import contains, startswith, endswith
from graphistry.compute.predicates import is_in as match_re  # For regex matching
from graphistry.compute.ast import n, e_forward, e_reverse, e_undirected, e

## 2. Load & enrich a US congress twitter interaction dataset

## 3. Simple filtering: `g.hop()` & `g.gfql([...])`

We can filter by nodes, edges, and combinations of them

The result is a graph where we can inspect the node and edge tables, or perform further graph operations, like visualization or further searches

**Key concepts**

There are 2 key methods:
* `g.hop(...)`: filter triples of source node, edge, destination node
* `g.gfql([....])`: arbitrarily long sequence of node and edge predicates

They reuse column operations core to dataframe libraries, such as comparison operators on strings, numbers, and dates

**Sample tasks**

This section shows how to:

* Find SenSchumer and his immediate community (infomap metric)
* Look at his entire community
* Find everyone with high edge weight from/to SenSchumer; 2 hops either direction
* Find everyone in his community

In [None]:
# Load the US Congress Twitter interaction dataset
# This dataset contains Twitter interactions between members of the US Congress
edges_df = pd.read_csv('https://raw.githubusercontent.com/graphistry/pygraphistry/master/demos/data/twitter_congress_edges.csv')
print(f"Loaded {len(edges_df)} edges")
edges_df.head()

In [None]:
g2.gfql([n({'title': 'SenSchumer'})])._nodes

In [None]:
### First, let's find immediate connections to SenSchumer

In [None]:
g_immediate_community2 = g2.gfql([n({'title': 'SenSchumer'}), e_undirected(), n({'community_infomap': 2})])

print(len(g_immediate_community2._nodes), 'senators', len(g_immediate_community2._edges), 'relns')
g_immediate_community2._edges[['from', 'to', 'weight2']].sort_values(by=['weight2']).head(10)

In [77]:
# Shape
g = graphistry.edges(edges_df, 'from', 'to')

# Enrich & style
# Tip: Switch from compute_igraph to compute_cugraph when GPUs are available
g2 = (g
      .materialize_nodes()
      .nodes(lambda g: g._nodes.assign(title=g._nodes.id))
      .edges(lambda g: g._edges.assign(weight2=g._edges.weight))
      .bind(point_title='title')
      .compute_igraph('community_infomap')
      .compute_igraph('pagerank')
      .get_degrees()
      .encode_point_color(
          'community_infomap',
          as_categorical=True,
          categorical_mapping={
              0: '#32a9a2', # vibrant teal
              1: '#ff6b6b', # soft coral
              2: '#f9d342', # muted yellow
          }
      )
)

g2._nodes



Unnamed: 0,id,title,community_infomap,pagerank,degree_in,degree_out,degree
0,SenatorBaldwin,SenatorBaldwin,0,0.001422,26,20,46
1,SenJohnBarrasso,SenJohnBarrasso,0,0.001179,22,19,41
2,SenatorBennet,SenatorBennet,0,0.001995,33,22,55
3,MarshaBlackburn,MarshaBlackburn,0,0.001331,18,38,56
4,SenBlumenthal,SenBlumenthal,0,0.001672,30,35,65
...,...,...,...,...,...,...,...
470,RepJoeWilson,RepJoeWilson,1,0.001780,21,38,59
471,RobWittman,RobWittman,1,0.001017,13,19,32
472,rep_stevewomack,rep_stevewomack,1,0.002637,35,19,54
473,RepJohnYarmuth,RepJohnYarmuth,2,0.000555,5,20,25


In [79]:
g2.plot()

## 3. Simple filtering: `g.hop()` & `g.gfql([...])`

We can filter by nodes, edges, and combinations of them

The result is a graph where we can inspect the node and edge tables, or perform further graph operations, like visualization or further searches

**Key concepts**

There are 2 key methods:
* `g.hop(...)`: filter triples of source node, edge, destination node
* `g.gfql([....])`: arbitrarily long sequence of node and edge predicates

They reuse column operations core to dataframe libraries, such as comparison operators on strings, numbers, and dates

**Sample tasks**

This section shows how to:

* Find SenSchumer and his immediate community (infomap metric)
* Look at his entire community
* Find everyone with high edge weight from/to SenSchumer; 2 hops either direction
* Find everyone in his community

In [None]:
g2.gfql([n({'title': 'SenSchumer'})])._nodes

## 4. Multi-hop and paths-between-nodes pattern mining

Method `gfql([...])` can be used for looking more than one hop out, and even finding paths between nodes.

In [None]:
g_immediate_community2 = g2.gfql([n({'title': 'SenSchumer'}), e_undirected(), n({'community_infomap': 2})])

print(len(g_immediate_community2._nodes), 'senators', len(g_immediate_community2._edges), 'relns')
g_immediate_community2._edges[['from', 'to', 'weight2']].sort_values(by=['weight2']).head(10)

In [None]:
g_shumer_pelosi_bridges = g2.gfql([
    n({'title': 'SenSchumer'}),
    e_undirected(),
    n(),
    e_undirected(),
    n({'title': 'SpeakerPelosi'})
])

print(len(g_shumer_pelosi_bridges._nodes), 'senators')
g_shumer_pelosi_bridges._edges.sort_values(by='weight').head(5)

Often, we are just filtering on a src node / edge / dst node triple, so `hop()` is a short-form for this. All the `hop()` parameters can also be passed to edge expressions as well.

In [83]:
g_community2 = g2.hop(source_node_match={'community_infomap': 2}, destination_node_match={'community_infomap': 2})

print(len(g_community2._nodes), 'senators', len(g_community2._edges), 'relns')
g_community2._edges.sort_values(by=['weight2']).head(10)

214 senators 4993 relns


Unnamed: 0,from,to,weight,weight2
378,RepDonBeyer,RepSpeier,0.000658,0.000658
354,RepDonBeyer,repcleaver,0.000658,0.000658
353,RepDonBeyer,RepYvetteClarke,0.000658,0.000658
352,RepDonBeyer,RepCasten,0.000658,0.000658
349,RepDonBeyer,RepBeatty,0.000658,0.000658
360,RepDonBeyer,RepGaramendi,0.000658,0.000658
361,RepDonBeyer,RepChuyGarcia,0.000658,0.000658
362,RepDonBeyer,RepRaulGrijalva,0.000658,0.000658
365,RepDonBeyer,USRepKeating,0.000658,0.000658
366,RepDonBeyer,RepRickLarsen,0.000658,0.000658


In [86]:
g_community2.encode_point_color('pagerank', ['blue', 'yellow', 'red'], as_continuous=True).plot()

## 4. Multi-hop and paths-between-nodes pattern mining

Method `gfql([...])` can be used for looking more than one hop out, and even finding paths between nodes.

g_high_pr = g2.gfql([
    n({'pagerank': ge(top_20_pr)}),
    e_undirected(),
    n({'pagerank': ge(top_20_pr)}),
])

len(g_high_pr._nodes)

In [None]:
g_high_pr = g2.gfql([
    n({'pagerank': ge(top_20_pr)}),
    e_undirected(),
    n({'pagerank': ge(top_20_pr)}),
])

len(g_high_pr._nodes)

In [92]:
g_shumer_pelosi_bridges.plot()

## 5. Advanced filter predicates

We can use a variety of predicates for filtering nodes and edges beyond attribute value equality.

Common tasks include comparing attributes using:
* Set inclusion: `is_in([...])`
* Numeric comparisons: `gt(...)`, `lt(...)`, `ge(...)`, `le(...)`
* String comparison: `startswith(...)`, `endswith(...)`, `contains(...)`
* Regular expression matching: `matches(...)`
* Duplicate checking: `duplicated()`

Graph where nodes are in the top 20 pagerank:

In [134]:
top_20_pr = g2._nodes.pagerank.sort_values(ascending=False, ignore_index=True)[19]
top_20_pr

0.005888600097034367

In [None]:
g_high_pr = g2.gfql([
    n({'pagerank': ge(top_20_pr)}),
    e_undirected(),
    n({'pagerank': ge(top_20_pr)}),
])

len(g_high_pr._nodes)

In [None]:
g_bridges2 = g2.gfql([
    n({'title': 'SenSchumer'}),
    e_undirected(name='from_schumer'),
    n(name='found_bridge'),
    e_undirected(name='from_pelosi'),
    n({'title': 'SpeakerPelosi'})
])

print(len(g_bridges2._nodes), 'senators in full graph')

named = g_bridges2._nodes[ g_bridges2._nodes.found_bridge ]
print(len(named), 'bridging senators')
edges = g_bridges2._edges
print(len(edges[edges.from_schumer]), 'relns from_schumer', len(edges[edges.from_pelosi]), 'relns from_pelosi')

g_bridges2.encode_point_color(
    'found_bridge',
    as_categorical=True,
    categorical_mapping={
        True: 'orange',
        False: 'silver'
    }
).plot()

Graph where the name includes Leader

In [136]:
g_leaders = g2.hop(
    source_node_match={'title': contains('Leader')},
    destination_node_match = {'title': contains('Leader')}
)

print(len(g_leaders._nodes), 'leaders')

g_leaders.plot()

2 leaders


Graph of leaders and senators

In [139]:
g_leaders_and_senators = g2.hop(
    source_node_match={'title': match_re(r'Sen|Leader')},
    destination_node_match = {'title': match_re(r'Sen|Leader')}
)

print(len(g_leaders_and_senators._nodes), 'leaders and senators')

g_leaders_and_senators.plot()

67 leaders and senators


## 6. Result labeling

It can be useful to name node and edges within the path query for downstream reasoning:

In [None]:
g_bridges2 = g2.gfql([
    n({'title': 'SenSchumer'}),
    e_undirected(name='from_schumer'),
    n(name='found_bridge'),
    e_undirected(name='from_pelosi'),
    n({'title': 'SpeakerPelosi'})
])

print(len(g_bridges2._nodes), 'senators in full graph')

named = g_bridges2._nodes[ g_bridges2._nodes.found_bridge ]
print(len(named), 'bridging senators')
edges = g_bridges2._edges
print(len(edges[edges.from_schumer]), 'relns from_schumer', len(edges[edges.from_pelosi]), 'relns from_pelosi')

g_bridges2.encode_point_color(
    'found_bridge',
    as_categorical=True,
    categorical_mapping={
        True: 'orange',
        False: 'silver'
    }
).plot()