# Create node embeddings feature groups.

Up until now we use feature engineering, feature store and model training to create node embedding. We will now materialise this as node embeddings feature group. This feature group will be used to train anomaly detection model.

![Feature Stores](./images/online_offline_fs.png)

---
**NOTE**: 

In real life scenarios financial transaction are dynamically evolving graphs. If live Transaction Monitoring System is based on graph or node embeddings then this will require 1st to update the graph and node representations after new transactions arrive. Recomputing entire graph for every newly arrived transaction will lead to unaxeptable delayes and even monitoring system failures. This problem  will be more sever if large amount of updates happen in a short time window.

Contact us at Logical Clocks and we will help you to setup end to end graph based deep anomaly detection live Transaction Monitoring Systems. 

---

## Query Model Repository for best node embeddings model

In [1]:
import hsml

conn = hsml.connection()
mr = conn.get_model_registry()

MODEL_NAME="NodeEmbeddings"
EVALUATION_METRIC="accuracy"


Starting Spark application


ID,Application ID,Kind,State,Spark UI,Driver log
5,application_1651410992845_0006,pyspark,idle,Link,Link


SparkSession available as 'spark'.
Connected. Call `.close()` to terminate connection gracefully.

In [2]:
best_model = mr.get_best_model(MODEL_NAME, EVALUATION_METRIC, "max")

In [3]:
print('Model name: ' + best_model.name)
print('Model version: ' + str(best_model.version))
print(best_model.training_metrics)

Model name: NodeEmbeddings
Model version: 1
{'accuracy': '0.708645224571228'}

## Define model and load wights 

In [4]:
import json

# tensorflow 
import tensorflow as tf
from tensorflow import keras  

# pandas and numpy
import numpy as np
import pandas as pd

# stellargraph library
from stellargraph import StellarDiGraph
from stellargraph.mapper import Node2VecLinkGenerator, Node2VecNodeGenerator
from stellargraph.data import UnsupervisedSampler, BiasedRandomWalk
from stellargraph.layer import Node2Vec

# pyspark functions
from pyspark.sql import functions as F
from pyspark.sql.functions import array, coalesce, concat,  col

# hops utility library for accessing files in HopsFS
from hops import hdfs

## connect hsfs library and get fs handle

In [5]:
import hsfs
# Create a connection
connection = hsfs.connection()
# Get the feature store handle for the project's feature store
fs = connection.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.

### Get node and edge traininhg dataset objects 

In [6]:
node_td = fs.get_training_dataset("node_td", 1)
edge_td = fs.get_training_dataset("edges_td", 1)

### Read training datasets as pandas df 

In [7]:
# Get fg as pandas
node_pdf = node_td.read().toPandas()
edge_pdf = edge_td.read().toPandas()

### Read hyperparamenter for graph embeddings

In [8]:
best_hyperparams_path = "Resources/embeddings_best_hp.json"
best_hyperparams = json.loads(hdfs.load(best_hyperparams_path))
args_dict = {}
for key in best_hyperparams.keys():
    args_dict[key] = [best_hyperparams[key]]
    

### Construct stellargraph Graph object

In [9]:
node_data = pd.DataFrame(node_pdf[['type']], index=node_pdf['id'])
print('Defining StellarDiGraph')
G =StellarDiGraph(node_data,
                      edges=edge_pdf, 
                      edge_type_column="tx_type")

Defining StellarDiGraph

### infer node embeddings

In [10]:
walk_number = args_dict['walk_number']
walk_length = args_dict['walk_length']
batch_size = 1
emb_size = args_dict['emb_size'][0]
# Extracting node embeddings
walker = BiasedRandomWalk(
        G,
        n=walk_number,
        length=walk_length,
        p=0.5,  # defines probability, 1/p, of returning to source node
        q=2.0,  # defines probability, 1/q, for moving to a node away from the source node
    )
unsupervised_samples = UnsupervisedSampler(G, nodes=list(G.nodes()), walker=walker)
generator = Node2VecLinkGenerator(G, batch_size)

node2vec = Node2Vec(emb_size, generator=generator)
x_inp, x_out = node2vec.in_out_tensors()

x_inp_src = x_inp[0]
x_out_src = x_out[0]
embedding_model = keras.Model(inputs=x_inp_src, outputs=x_out_src)

In [11]:
nodes = list(G.nodes())
node_gen = Node2VecNodeGenerator(G, batch_size).flow(nodes)

In [12]:
pdf = pd.DataFrame(embedding_model.predict(node_gen), index=G.nodes())

In [13]:
emb_feature_names = ["em_" + str(c)  for c in pdf.columns]
pdf.columns = emb_feature_names
pdf['id'] = pdf.index
node_embeddings_df = spark.createDataFrame(pdf)

In [14]:
node_embeddings_df.show(2)

+--------------------+-------------------+--------------------+-------------------+-------------------+-------------------+--------------------+-------------------+-------------------+--------------------+--------------------+-------------------+-------------------+-------------------+-------------------+-------------------+--------------------+--------------------+--------------------+-------------------+------------------+------------------+------------------+------------------+-------------------+-------------------+-------------------+--------------------+--------------------+-------------------+--------------------+-------------------+--------+
|                em_0|               em_1|                em_2|               em_3|               em_4|               em_5|                em_6|               em_7|               em_8|                em_9|               em_10|              em_11|              em_12|              em_13|              em_14|              em_15|               e

In [15]:
node_embeddings_df = node_embeddings_df.withColumn("embedding", array(emb_feature_names)).select("id","embedding")
node_embeddings_df.show(2)

+--------+--------------------+
|      id|           embedding|
+--------+--------------------+
|9ad5bb7e|[-0.7732861042022...|
|9ad7aece|[-0.2603678703308...|
+--------+--------------------+
only showing top 2 rows

## Create embeddings feature group

In [16]:
from hsfs import engine
features = engine.get_instance().parse_schema_feature_group(node_embeddings_df)
for f in features:
    if f.type == "array<double>":
        f.online_type = "VARBINARY(200)"

In [17]:
node_embeddings_fg = fs.create_feature_group(name="node_embeddings_fg",
                                       version=1,
                                       primary_key=["id"],
                                       description="node embeddings from transactions",
                                       time_travel_format="HUDI",     
                                       online_enabled=True,                                                
                                       statistics_config={"enabled": False, "histograms": False, "correlations": False, "exact_uniqueness": False},
                                       features=features)

node_embeddings_fg.save(node_embeddings_df)

## Feature group provenance
![Feature group provenance](./images/provenance_fg.png)