# Create node embeddings feature groups.

Up until now we use feature engineering, feature store and model training to create node embedding. We will now materialise this as node embeddings feature group. This feature group will be used to train anomaly detection model.

![Feature Stores](./images/online_offline_fs.png)

---
**NOTE**: 

In real life scenarios financial transaction are dynamically evolving graphs. If live Transaction Monitoring System is based on graph or node embeddings then this will require 1st to update the graph and node representations after new transactions arrive. Recomputing entire graph for every newly arrived transaction will lead to unaxeptable delayes and even monitoring system failures. This problem  will be more sever if large amount of updates happen in a short time window.

Contact us at Logical Clocks and we will help you to setup end to end graph based deep anomaly detection live Transaction Monitoring Systems. 

---

## Query Model Repository for best node embeddings model

In [1]:
import hsml

conn = hsml.connection()
mr = conn.get_model_registry()

MODEL_NAME="NodeEmbeddings"
EVALUATION_METRIC="accuracy"


Connected. Call `.close()` to terminate connection gracefully.


In [2]:
best_model = mr.get_best_model(MODEL_NAME, EVALUATION_METRIC, "max")

In [3]:
print('Model name: ' + best_model.name)
print('Model version: ' + str(best_model.version))
print(best_model.training_metrics)

Model name: NodeEmbeddings
Model version: 1
{'accuracy': '0.7269180417060852'}


## Define model and load wights 

In [5]:
import json

# tensorflow 
import tensorflow as tf
from tensorflow import keras  

# pandas and numpy
import numpy as np
import pandas as pd

# stellargraph library
from stellargraph import StellarDiGraph
from stellargraph.mapper import Node2VecLinkGenerator, Node2VecNodeGenerator
from stellargraph.data import UnsupervisedSampler, BiasedRandomWalk
from stellargraph.layer import Node2Vec

# hops utility library for accessing files in HopsFS
from hops import hdfs

## connect hsfs library and get fs handle

In [6]:
import hsfs
# Create a connection
connection = hsfs.connection()
# Get the feature store handle for the project's feature store
fs = connection.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


### Get node and edge traininhg dataset objects 

In [8]:
# Get nodes and edges feature view from hsfs
node_fv = fs.get_feature_view(
        name = 'nodes_feature_view',
        version = 1
    )
    
edge_fv = fs.get_feature_view(
        name = 'edges_feature_view',
        version = 1
    )

# Get nodes and edges training datasets from hsfs 
_, node_pdf = node_fv.get_training_dataset(version = 1)
    
_, edge_pdf_2020 = edge_fv.get_training_dataset(version = 1)
_, edge_pdf_2021 = edge_fv.get_training_dataset(version = 2)
edge_pdf = edge_pdf_2020.append(edge_pdf_2021, ignore_index=True)



### Read hyperparamenter for graph embeddings

In [9]:
best_hyperparams_path = "Resources/embeddings_best_hp.json"
best_hyperparams = json.loads(hdfs.load(best_hyperparams_path))
args_dict = {}
for key in best_hyperparams.keys():
    args_dict[key] = [best_hyperparams[key]]

### Construct stellargraph Graph object

In [11]:
node_data = pd.DataFrame(node_pdf[['type']], index=node_pdf['id'])
print('Defining StellarDiGraph')
G =StellarDiGraph(node_data,
                      edges=edge_pdf)

Defining StellarDiGraph


### infer node embeddings

In [12]:
walk_number = args_dict['walk_number']
walk_length = args_dict['walk_length']
batch_size = 1
emb_size = args_dict['emb_size'][0]
# Extracting node embeddings
walker = BiasedRandomWalk(
        G,
        n=walk_number,
        length=walk_length,
        p=0.5,  # defines probability, 1/p, of returning to source node
        q=2.0,  # defines probability, 1/q, for moving to a node away from the source node
    )
unsupervised_samples = UnsupervisedSampler(G, nodes=list(G.nodes()), walker=walker)
generator = Node2VecLinkGenerator(G, batch_size)

node2vec = Node2Vec(emb_size, generator=generator)
x_inp, x_out = node2vec.in_out_tensors()

x_inp_src = x_inp[0]
x_out_src = x_out[0]
embedding_model = keras.Model(inputs=x_inp_src, outputs=x_out_src)

In [13]:
nodes = list(G.nodes())
node_gen = Node2VecNodeGenerator(G, batch_size).flow(nodes)

In [16]:
node_embeddings_df = pd.DataFrame(embedding_model.predict(node_gen), index=G.nodes())

In [17]:
emb_feature_names = ["em_" + str(c)  for c in node_embeddings_df.columns]
node_embeddings_df.columns = emb_feature_names
node_embeddings_df['id'] = node_embeddings_df.index

In [18]:
node_embeddings_df.head()

Unnamed: 0_level_0,em_0,em_1,em_2,em_3,em_4,em_5,em_6,em_7,em_8,em_9,...,em_23,em_24,em_25,em_26,em_27,em_28,em_29,em_30,em_31,id
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
90edf3b3,0.884438,0.171425,-0.587368,-0.046652,0.700641,-0.078287,0.904313,0.330084,-0.222736,0.720094,...,0.878541,0.637493,0.101538,-0.864437,-0.506918,-0.733507,-0.261949,0.809876,0.751352,90edf3b3
f3e394b3,0.631339,-0.587456,0.089011,0.11126,-0.083255,0.845549,-0.043291,0.125343,-0.366857,-0.827737,...,-0.238882,0.319524,0.934617,0.226567,0.637395,-0.352562,-0.707387,-0.937521,0.598549,f3e394b3
f4f7c3ff,0.541005,-0.071699,0.264466,-0.991774,-0.2707,-0.303992,0.496623,0.369185,-0.069429,0.744717,...,0.790734,0.574312,-0.619517,-0.396369,-0.324199,0.795869,0.293409,0.802166,0.169027,f4f7c3ff
ba504e94,0.634268,-0.108996,0.483871,0.77344,0.524577,0.949731,0.485697,-0.465637,0.297805,0.705992,...,0.821905,-0.726782,-0.478064,-0.923091,-0.817952,-0.329499,0.56505,0.595801,-0.354519,ba504e94
ff3b05c4,0.998889,0.828497,0.833624,0.729858,-0.350022,0.345915,-0.00361,-0.440018,0.780548,0.194602,...,-0.471269,0.993464,0.699982,0.240239,0.165529,0.59796,0.32568,-0.309894,0.072704,ff3b05c4


In [26]:
node_embeddings_df["embedding"] = node_embeddings_df[emb_feature_names].to_numpy().tolist()
node_embeddings_df.drop(emb_feature_names, axis=1, inplace=True)

In [27]:
node_embeddings_df.head()

Unnamed: 0_level_0,id,embedding
id,Unnamed: 1_level_1,Unnamed: 2_level_1
90edf3b3,90edf3b3,"[0.8844377994537354, 0.17142486572265625, -0.5..."
f3e394b3,f3e394b3,"[0.6313390731811523, -0.5874557495117188, 0.08..."
f4f7c3ff,f4f7c3ff,"[0.5410051345825195, -0.07169866561889648, 0.2..."
ba504e94,ba504e94,"[0.6342678070068359, -0.10899639129638672, 0.4..."
ff3b05c4,ff3b05c4,"[0.9988889694213867, 0.8284969329833984, 0.833..."


## Create embeddings feature group

In [28]:
from hsfs import engine
features = engine.get_instance().parse_schema_feature_group(node_embeddings_df)
for f in features:
    if f.type == "array<double>":
        f.online_type = "VARBINARY(200)"

In [29]:
node_embeddings_fg = fs.create_feature_group(name="node_embeddings_fg",
                                       version=1,
                                       primary_key=["id"],
                                       description="node embeddings from transactions",
                                       time_travel_format="HUDI",     
                                       online_enabled=True,                                                
                                       statistics_config={"enabled": False, "histograms": False, "correlations": False, "exact_uniqueness": False},
                                       features=features)

node_embeddings_fg.save(node_embeddings_df)

Launching offline feature group backfill job...
Backfill Job started successfully, you can follow the progress at https://hopsworks.glassfish.service.consul:8182/p/119/jobs/named/node_embeddings_fg_1_offline_fg_backfill/executions


<hsfs.core.job.Job at 0x7fc186cfb5b0>

## Feature group provenance
![Feature group provenance](./images/provenance_fg.png)