# Graph Pattern Mining with hop() and gfql()

This tutorial demonstrates how to use PyGraphistry's `hop()` and `gfql()` methods for graph pattern mining and traversal.

**Key concepts:**
- `g.hop()`: Filter by source node → edge → destination node patterns
- `g.gfql()`: Chain multiple node and edge filters for complex patterns
- Predicates: Use comparisons, string matching, and other filters
- Result labeling: Name intermediate results for analysis

We'll explore these concepts using a US Congress Twitter interaction dataset.

## 1. Install & configure

In [None]:
graphistry.register(api=3, username='...', password='...')

In [None]:
import pandas as pd
import graphistry
from graphistry.compute.predicates import is_in, gt, lt, ge, le, eq, ne
from graphistry.compute.predicates import contains, startswith, endswith
from graphistry.compute.predicates import is_in as match_re  # For regex matching
from graphistry.compute.ast import n, e_forward, e_reverse, e_undirected, e

## 2. Load & enrich a US congress twitter interaction dataset

This notebook uses an aggregated version of the Twitter-Congress dataset by Drew Conway:
https://github.com/drewconway/Twitter-Congress

We collapse multiedges into a single weighted edge and store the result in
`demos/data/twitter_congress_edges_weighted.csv.gz` for reproducible docs builds.


## 3. Simple filtering: `g.hop()` & `g.gfql([...])`

We can filter by nodes, edges, and combinations of them

The result is a graph where we can inspect the node and edge tables, or perform further graph operations, like visualization or further searches

**Key concepts**

There are 2 key methods:
* `g.hop(...)`: filter triples of source node, edge, destination node
* `g.gfql([....])`: arbitrarily long sequence of node and edge predicates

They reuse column operations core to dataframe libraries, such as comparison operators on strings, numbers, and dates

**Sample tasks**

This section shows how to:

* Find SenSchumer and his immediate community (infomap metric)
* Look at his entire community
* Find everyone with high edge weight from/to SenSchumer; 2 hops either direction
* Find everyone in his community

In [None]:
# Load the US Congress Twitter interaction dataset
# This dataset contains Twitter interactions between members of the US Congress
edges_df = pd.read_csv('../../data/twitter_congress_edges_weighted.csv.gz')
print(f"Loaded {len(edges_df)} edges")
edges_df.head()


In [None]:
g2.gfql([n({'title': 'SenSchumer'})])._nodes

In [None]:
### First, let's find immediate connections to SenSchumer

In [None]:
g_immediate_community2 = g2.gfql([n({'title': 'SenSchumer'}), e_undirected(), n({'community_infomap': 2})])

print(len(g_immediate_community2._nodes), 'senators', len(g_immediate_community2._edges), 'relns')
g_immediate_community2._edges[['from', 'to', 'weight2']].sort_values(by=['weight2']).head(10)

In [77]:
# Shape
g = graphistry.edges(edges_df, 'from', 'to')

# Enrich & style
# Tip: Switch from compute_igraph to compute_cugraph when GPUs are available
g2 = (g
      .materialize_nodes()
      .nodes(lambda g: g._nodes.assign(title=g._nodes.id))
      .edges(lambda g: g._edges.assign(weight2=g._edges.weight))
      .bind(point_title='title')
      .compute_igraph('community_infomap')
      .compute_igraph('pagerank')
      .get_degrees()
      .encode_point_color(
          'community_infomap',
          as_categorical=True,
          categorical_mapping={
              0: '#32a9a2', # vibrant teal
              1: '#ff6b6b', # soft coral
              2: '#f9d342', # muted yellow
          }
      )
)

g2._nodes



Unnamed: 0,id,title,community_infomap,pagerank,degree_in,degree_out,degree
0,SenatorBaldwin,SenatorBaldwin,0,0.001422,26,20,46
1,SenJohnBarrasso,SenJohnBarrasso,0,0.001179,22,19,41
2,SenatorBennet,SenatorBennet,0,0.001995,33,22,55
3,MarshaBlackburn,MarshaBlackburn,0,0.001331,18,38,56
4,SenBlumenthal,SenBlumenthal,0,0.001672,30,35,65
...,...,...,...,...,...,...,...
470,RepJoeWilson,RepJoeWilson,1,0.001780,21,38,59
471,RobWittman,RobWittman,1,0.001017,13,19,32
472,rep_stevewomack,rep_stevewomack,1,0.002637,35,19,54
473,RepJohnYarmuth,RepJohnYarmuth,2,0.000555,5,20,25


In [79]:
g2.plot()

## 3. Simple filtering: `g.hop()` & `g.gfql([...])`

We can filter by nodes, edges, and combinations of them

The result is a graph where we can inspect the node and edge tables, or perform further graph operations, like visualization or further searches

**Key concepts**

There are 2 key methods:
* `g.hop(...)`: filter triples of source node, edge, destination node
* `g.gfql([....])`: arbitrarily long sequence of node and edge predicates

They reuse column operations core to dataframe libraries, such as comparison operators on strings, numbers, and dates

**Sample tasks**

This section shows how to:

* Find SenSchumer and his immediate community (infomap metric)
* Look at his entire community
* Find everyone with high edge weight from/to SenSchumer; 2 hops either direction
* Find everyone in his community

In [None]:
g2.gfql([n({'title': 'SenSchumer'})])._nodes

## 4. Multi-hop and paths-between-nodes pattern mining

Method `gfql([...])` can be used for looking more than one hop out, and even finding paths between nodes.

In [None]:
g_immediate_community2 = g2.gfql([n({'title': 'SenSchumer'}), e_undirected(), n({'community_infomap': 2})])

print(len(g_immediate_community2._nodes), 'senators', len(g_immediate_community2._edges), 'relns')
g_immediate_community2._edges[['from', 'to', 'weight2']].sort_values(by=['weight2']).head(10)

In [None]:
g_shumer_pelosi_bridges = g2.gfql([
    n({'title': 'SenSchumer'}),
    e_undirected(),
    n(),
    e_undirected(),
    n({'title': 'SpeakerPelosi'})
])

print(len(g_shumer_pelosi_bridges._nodes), 'senators')
g_shumer_pelosi_bridges._edges.sort_values(by='weight').head(5)

Often, we are just filtering on a src node / edge / dst node triple, so `hop()` is a short-form for this. All the `hop()` parameters can also be passed to edge expressions as well.

In [83]:
g_community2 = g2.hop(source_node_match={'community_infomap': 2}, destination_node_match={'community_infomap': 2})

print(len(g_community2._nodes), 'senators', len(g_community2._edges), 'relns')
g_community2._edges.sort_values(by=['weight2']).head(10)

214 senators 4993 relns


Unnamed: 0,from,to,weight,weight2
378,RepDonBeyer,RepSpeier,0.000658,0.000658
354,RepDonBeyer,repcleaver,0.000658,0.000658
353,RepDonBeyer,RepYvetteClarke,0.000658,0.000658
352,RepDonBeyer,RepCasten,0.000658,0.000658
349,RepDonBeyer,RepBeatty,0.000658,0.000658
360,RepDonBeyer,RepGaramendi,0.000658,0.000658
361,RepDonBeyer,RepChuyGarcia,0.000658,0.000658
362,RepDonBeyer,RepRaulGrijalva,0.000658,0.000658
365,RepDonBeyer,USRepKeating,0.000658,0.000658
366,RepDonBeyer,RepRickLarsen,0.000658,0.000658


In [86]:
g_community2.encode_point_color('pagerank', ['blue', 'yellow', 'red'], as_continuous=True).plot()

## 4. Multi-hop and paths-between-nodes pattern mining

Method `gfql([...])` can be used for looking more than one hop out, and even finding paths between nodes.

g_high_pr = g2.gfql([
    n({'pagerank': ge(top_20_pr)}),
    e_undirected(),
    n({'pagerank': ge(top_20_pr)}),
])

len(g_high_pr._nodes)

In [None]:
g_high_pr = g2.gfql([
    n({'pagerank': ge(top_20_pr)}),
    e_undirected(),
    n({'pagerank': ge(top_20_pr)}),
])

len(g_high_pr._nodes)

In [92]:
g_shumer_pelosi_bridges.plot()

## 5. Advanced filter predicates

We can use a variety of predicates for filtering nodes and edges beyond attribute value equality.

Common tasks include comparing attributes using:
* Set inclusion: `is_in([...])`
* Numeric comparisons: `gt(...)`, `lt(...)`, `ge(...)`, `le(...)`
* String comparison: `startswith(...)`, `endswith(...)`, `contains(...)`
* Regular expression matching: `matches(...)`
* Duplicate checking: `duplicated()`

Graph where nodes are in the top 20 pagerank:

In [134]:
top_20_pr = g2._nodes.pagerank.sort_values(ascending=False, ignore_index=True)[19]
top_20_pr

0.005888600097034367

In [None]:
g_high_pr = g2.gfql([
    n({'pagerank': ge(top_20_pr)}),
    e_undirected(),
    n({'pagerank': ge(top_20_pr)}),
])

len(g_high_pr._nodes)

In [None]:
g_bridges2 = g2.gfql([
    n({'title': 'SenSchumer'}),
    e_undirected(name='from_schumer'),
    n(name='found_bridge'),
    e_undirected(name='from_pelosi'),
    n({'title': 'SpeakerPelosi'})
])

print(len(g_bridges2._nodes), 'senators in full graph')

named = g_bridges2._nodes[ g_bridges2._nodes.found_bridge ]
print(len(named), 'bridging senators')
edges = g_bridges2._edges
print(len(edges[edges.from_schumer]), 'relns from_schumer', len(edges[edges.from_pelosi]), 'relns from_pelosi')

g_bridges2.encode_point_color(
    'found_bridge',
    as_categorical=True,
    categorical_mapping={
        True: 'orange',
        False: 'silver'
    }
).plot()

Graph where the name includes Leader

In [136]:
g_leaders = g2.hop(
    source_node_match={'title': contains('Leader')},
    destination_node_match = {'title': contains('Leader')}
)

print(len(g_leaders._nodes), 'leaders')

g_leaders.plot()

2 leaders


Graph of leaders and senators

In [139]:
g_leaders_and_senators = g2.hop(
    source_node_match={'title': match_re(r'Sen|Leader')},
    destination_node_match = {'title': match_re(r'Sen|Leader')}
)

print(len(g_leaders_and_senators._nodes), 'leaders and senators')

g_leaders_and_senators.plot()

67 leaders and senators


## 6. Result labeling

It can be useful to name node and edges within the path query for downstream reasoning:

In [6]:
g_bridges2 = g2.gfql([
    n({'title': 'SenSchumer'}),
    e_undirected(name='from_schumer'),
    n(name='found_bridge'),
    e_undirected(name='from_pelosi'),
    n({'title': 'SpeakerPelosi'})
])

print(len(g_bridges2._nodes), 'senators in full graph')

named = g_bridges2._nodes[ g_bridges2._nodes.found_bridge ]
print(len(named), 'bridging senators')
edges = g_bridges2._edges
print(len(edges[edges.from_schumer]), 'relns from_schumer', len(edges[edges.from_pelosi]), 'relns from_pelosi')

g_bridges2.encode_point_color(
    'found_bridge',
    as_categorical=True,
    categorical_mapping={
        True: 'orange',
        False: 'silver'
    }
).plot()

25 senators in full graph
23 bridging senators
23 relns from_schumer 32 relns from_pelosi


## 7. Pattern Reuse with Let Bindings

The `let` operator allows you to define named graph patterns that can be referenced multiple times in your query. This is particularly useful for:
- Creating reusable pattern components
- Building complex patterns from simpler building blocks
- Avoiding repetition in pattern definitions

Let's explore how to use `let` bindings for finding triangles and other complex patterns.

In [7]:
# Finding triangles using let bindings
# Define a reusable pattern for high-influence nodes (top 30% pagerank)
top_30_pr = g2._nodes.pagerank.quantile(0.7)

# Find triangles of high-influence members
g_triangles = g2.gfql([
    {
        'let': {
            # Define a pattern for high-influence nodes
            'influential': n({'pagerank': ge(top_30_pr)}),
            # Define a pattern for strong connections
            'strong_edge': e_undirected({'weight': ge(0.01)})
        }
    },
    # Use the defined patterns to find triangles
    {'pattern': 'influential', 'name': 'node_a'},
    {'pattern': 'strong_edge'},
    {'pattern': 'influential', 'name': 'node_b'},
    {'pattern': 'strong_edge'},
    {'pattern': 'influential', 'name': 'node_c'},
    {'pattern': 'strong_edge'},
    {'pattern': 'influential', 'name': 'node_a'}  # Close the triangle
])

print(f"Found {len(g_triangles._nodes)} nodes in triangles")
print(f"Found {len(g_triangles._edges)} edges in triangles")

# Visualize the triangles
g_triangles.encode_point_color('community_infomap', as_categorical=True).plot()

Found 108 nodes in triangles
Found 2772 edges in triangles


### Finding Community Bridge Patterns with Let

Let's use `let` to define reusable patterns for finding members who bridge different communities:

In [8]:
# Find members who bridge communities using let bindings
g_community_bridges = g2.gfql([
    {
        'let': {
            # Pattern for community 0 members
            'community_0': n({'community_infomap': 0}),
            # Pattern for community 1 members  
            'community_1': n({'community_infomap': 1}),
            # Pattern for community 2 members
            'community_2': n({'community_infomap': 2}),
            # Pattern for any edge
            'any_edge': e_undirected()
        }
    },
    # Find paths from community 0 to community 1 through community 2
    {'pattern': 'community_0', 'name': 'start'},
    {'pattern': 'any_edge'},
    {'pattern': 'community_2', 'name': 'bridge'},
    {'pattern': 'any_edge'},
    {'pattern': 'community_1', 'name': 'end'}
])

print(f"Found {len(g_community_bridges._nodes)} nodes in bridging pattern")
bridges = g_community_bridges._nodes[g_community_bridges._nodes.bridge]
print(f"Community 2 members acting as bridges: {list(bridges.title.values)}")

# Visualize with bridge nodes highlighted
g_community_bridges.encode_point_color(
    'bridge',
    as_categorical=True,
    categorical_mapping={
        True: 'red',
        False: 'lightgray'
    }
).encode_point_size('bridge', categorical_mapping={True: 80, False: 40}).plot()

Found 7 nodes in bridging pattern
Community 2 members acting as bridges: ['RepBoswell']


### Complex Pattern Composition with Let

Let's create more sophisticated patterns by composing smaller patterns:

In [9]:
# Find star patterns around influential nodes
# A star pattern is where one central node connects to multiple others

g_star_patterns = g2.gfql([
    {
        'let': {
            # Very influential nodes (top 10%)
            'very_influential': n({'pagerank': ge(g2._nodes.pagerank.quantile(0.9))}),
            # Moderately influential nodes (top 50%)
            'moderately_influential': n({'pagerank': ge(g2._nodes.pagerank.quantile(0.5))}),
            # Strong bidirectional connection
            'strong_connection': e_undirected({'weight': ge(0.02)})
        }
    },
    # Find star patterns: very influential center connected to multiple moderately influential nodes
    {'pattern': 'very_influential', 'name': 'center'},
    {'pattern': 'strong_connection'},
    {'pattern': 'moderately_influential', 'name': 'spoke1'},
    # Return to center
    e_undirected(),
    {'pattern': 'very_influential', 'name': 'center'},
    {'pattern': 'strong_connection'},
    {'pattern': 'moderately_influential', 'name': 'spoke2'},
    # Return to center again
    e_undirected(),
    {'pattern': 'very_influential', 'name': 'center'},
    {'pattern': 'strong_connection'},
    {'pattern': 'moderately_influential', 'name': 'spoke3'}
])

print(f"Found {len(g_star_patterns._nodes)} nodes in star patterns")
centers = g_star_patterns._nodes[g_star_patterns._nodes.center]
print(f"Central nodes: {list(centers.title.unique())[:5]}...")  # Show first 5

# Visualize with centers highlighted
g_star_patterns.encode_point_color(
    'center',
    as_categorical=True,
    categorical_mapping={
        True: 'gold',
        False: 'lightblue'
    }
).encode_point_size(
    'center',
    categorical_mapping={True: 100, False: 50}
).plot()

Found 177 nodes in star patterns
Central nodes: ['GOPLeader', 'RepBachmann', 'RepBlackburn', 'RepBoehner', 'RepChaffetz']...


### Benefits of Let Bindings

The `let` operator provides several advantages:

1. **Reusability**: Define a pattern once and use it multiple times
2. **Readability**: Give meaningful names to complex patterns
3. **Maintainability**: Change pattern definitions in one place
4. **Composability**: Build complex patterns from simpler components

This makes it easier to explore and mine complex graph patterns in your data!