# Hello PyGraphistry[ai] - HackerNews visual semantic search with UMAP & BERT and 

`PyGraphistry[ai]` can quickly create visual graph search interfaces for structured text. It automates much of the work in cleaning, connecting, encoding, searching, and visualing graph data. The result is increasing the *time to graph* and overall results in as little as one line of code.

This notebook shows how to turn 3,000 HackerNews articles into an interactive visual graph with full semantic search. The core flow is a short number of lines and trains in 2 minutes on a CPU and 100-200x faster on GPU. The notebooks carefully demonstrate how to create a fast automatic feature engineering pipeline, which exposes matrices and targets, a Scikits like API, full semantic search over the data which returns dataframes or subgraphs from the query, and `GNN` models and pipelines.

Outline:

* load the data into a graphistry instance, `g = graphistry.nodes(dataframe)`
* since we do not have explicit edges, we will create a similarity graph using UMAP, `g.umap(..)` 
    which will call the `g.featurize(...)` api to create features, then UMAP them, adding an implicit edge dataframe which you can access with `g._edges` (with `g._nodes` the original dataframe) 
* Once the models are built we can search the data and display subgraphs from the search query itself
    using `g.search(query)` and `g.search_graph(query).plot()`
* Transforming on new data using `g.transform(..)`, useful for online or API driven endpoints after a data model has been set
* lastly, create a DGL GNN data model `g.build_gnn(...)` which may be used for downstream `GNN` modeling

Searching over data is useful to refine and find sugraphs over the global corpus of documents/events/data. Search can be operationalized over logs data (see morpheus demo), eCommerce (see clickstream and user-item-recommendation demo), stock and coin data (see crypto-slim demo), OSINT data, etc.

`GNN`s built over these feature encodings are useful for downstream modeling like link prediction, node classification, motif mining and other popular graph AI pipelines. 

In [None]:
# depends on where you have your data/ folder
#mkdir data

In [None]:
#! pip install --upgrade graphistry[ai]   # get the latest graphistry AI 

In [None]:
# cd .. 

In [None]:
import os
from collections import Counter

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import graphistry

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_colwidth', 200)

In [None]:
alpha = 1/137
np.random.seed(int(alpha**-1))  

In [None]:
# add your hub credentials here
graphistry.register(api=3, protocol="https", server="hub.graphistry.com", username = os.environ['USERNAME'], password=os.environ['GRAPHISTRY_PASSWORD'])

In [None]:
# get the data top 3000 posts on Hacker News
df = pd.read_csv('https://storage.googleapis.com/cohere-assets/blog/text-clustering/data/askhn3k_df.csv', index_col=0)

In [None]:
good_cols = ['title', 'text']

In [None]:
df.columns

In [None]:
df.head()  # see the dataset

In [None]:
df[good_cols].head()

# Featurize and Encode the Data

In [None]:
from time import time
t0 = time()
################################################################
## Two Lines of codes cuts through 80% of the datasciencing 

df = df.sample(3000) # set smaller if you want to test a minibatch 

################################################################
# create the graphistry instance
g = graphistry.nodes(df)

# set to False if you want to reload last trained instance
process = True

if process:
    # Umap will create a similarity graph from the features which we can view as a graph
    g2 = g.umap(X=['title', 'text'], # the features to encode (can add/remove 'text', etc)
                y=['score'], # for demonstrative purposes, we include a target -- though this one is not really conditioned on textual features in a straightforward way
                model_name='msmarco-distilbert-base-v2', #'paraphrase-MiniLM-L6-v2', etc, from sbert/Huggingface, the text encoding model
                min_words = 0, # when 0 forces all X=[..] as textually encoded, higher values would ascertain if a column is textual or not depending on average number of words per column
                use_ngrams=False, # set to True if you want ngram features instead (does not make great plots but useful for other situations)
                use_scaler_target='zscale', # for regressive targets
                use_scaler=None, # there are many more settings see `g.featurize?` and `g.umap?` for further options
               )
    g2.save_search_instance('data/hn.search')
    print('-'*80)
    print(f'Encoding {df.shape[0]} records using {str(g2._node_encoder.text_model)[:19]} took {(time()-t0)/60:.2f} minutes')
else:
    # or load the search instance
    g2 = g.load_search_instance('data/hn.search')
    print('-'*80)
    print(f'Loaded saved instance')
    
################################################################


In [None]:
# see all the data
g2.plot()

In [None]:
# get the encoded features, and use in downstream models (clf.fit(x, y), etc)
x=g2._get_feature('nodes')
x

In [None]:
# likewise with the (scaled) targets
y = g2._get_target('nodes')
y

In [None]:
# visualize the results where we prune edges using the `filter_weighted_edges` method
# this keeps all weights that are (more similar) 0.5 and above. The initial layout is the same (given by umap in 2d)
g25 = g2.filter_weighted_edges(0.5)
g25.plot(render=True)

# Let's query the graph

In [None]:
# direct keyword search when fuzzy=False and a set of columns are given, does not require featurization
g.search('love', fuzzy=False, cols=['title'])[0][['title']]

In [None]:
# Query semantically instead of strict keyword matching

sample_queries = ['Is true love possible?', 
                  'How to create deep learning models?', 
                  'Best tech careers',
                  'How do I make more money?', 
                  'Advances in particle physics', 
                  'Best apps and gadgets', 
                  'Graph Neural Networks', 
                  'recommend impactful books', 
                  'lamenting about life']

for query in sample_queries:
    print('*'*33)
    print(query)
    print('*'*30)
    # use the featurized instance g2 for semantic search
    results_df, encoded_query_vector = g2.search(query)
    print(results_df['title'])
    print('-'*60)

# Search to Graph

We may also query and create a graph of results. This returns the nodes found during `g.search` and then pulls in any edges of those nodes in both the `src` AND `dst` or, with `broader=True`, with nodes in `src` OR `dst` -- the latter can be useful in user-relationship-item/user/behavioral datasets and recommendation strategies where NLP search can help recall/create ontologically similar mini-batches to broaden scope. 

In [None]:
gr = g2.search_graph('How to create deep learning models', thresh=15, top_n=50, scale=0.25, broader=False) 
gr.plot()

In [None]:
g2.search_graph('Graph Neural Networks', thresh=50, top_n=50, scale=0.1, broader=False).plot()

In [None]:
g2.search_graph('fraud detection algorithms', thresh=50, top_n=50, scale=0.1, broader=False).plot()  # works better if you encode 'text' column as well

# To Demonstrate transforming on new or unseen data (imagine a train test split or new mini batch)

In [None]:
x, y = g2.transform(df.sample(10), df.sample(10), kind='nodes')  # or edges if given or already produced through umap-ing the nodes, 
                                                                #and if neither, set `embedding=True` for random embedding of size `n_topics`
x

Likewise, we can `transform_umap` to get the embedding coordinates as well

In [None]:
emb, x, y = g2.transform_umap(df.sample(10), df.sample(10))
emb

# Build a GNN model 

In [None]:
# this inherets all the arguments from the g.featurize api for both nodes and edges, see g.build_gnn? for details
g3 = g25.build_gnn()  # we use the filtered edges graphistry instance as it has higher fidelity similarity scores on edges
                        # ie, less edges

In [None]:
# notice the difference in edge dataframes between g2/5 and g3
g25._edges

In [None]:
# versus
g3._edges

In [None]:
# Edges come from data supplied by umap on nodes
g3._edge_encoder.feature_names_in

In [None]:
g3._edge_features.head()

In [None]:
# Since edges are featurized, we can transform on "unseen/batch" ones
# y_edges will be none since we don't have a label for the implicit edges. One could supply it via enrichment (like clustering, annotation etc)
edge_data = g3._edges.sample(10)

x_edges, _ = g3.transform(edge_data, None, kind='edges')
x_edges

In [None]:
# once built, we can get the DGL graph itself
G = g3.DGL_graph
G

In [None]:
# the features, targets, and masks
G.ndata

In [None]:
# `build_gnn()` will turn edges gotten from umap into bonafide feature matrices, 
# and make features out of explicit edges with `build_gnn(X_edges=[...], ..)`
G.edata['feature'].shape

In [None]:
# see the edge features which are shape (n_edges, n_nodes + weight)
# notice that had we used filter_weighted_edges to create a new graphistry instance and then .build_gnn() we would get
# a different n_edges. Useful to keep in mind when building models without an explicit edge_dataframe
plt.figure(figsize=(15,8))
plt.imshow(G.edata['feature'][:400, :600], aspect='auto', cmap='hot')

In [None]:
# see the way edges are related across the first 500 edges.
plt.figure(figsize=(15,8))
plt.imshow(np.cov(G.edata['feature'][:500]), aspect='auto', cmap='hot')

In [None]:
# to see how to train a GNN, see the cyber or influence tutorial