# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 01: Load, Engineer & Connect</span>

<span style="font-width:bold; font-size: 1.4rem;"> This is the first part of the quick start series of tutorials about Hopsworks Feature Store. As part of this first module, we will work with data related to credit card transactions. 
The objective of this tutorial is to demonstrate how to work with the **Hopworks Feature Store**  for batch data with a goal of training and deploying a model that can predict fraudulent transactions.</span>

## **🗒️ This notebook is divided in 3 sections:** 
1. Loading the data and do feature engineeing,
2. Connect to the Hopsworks feature store,
3. Create feature groups and upload them to the feature store.

![tutorial-flow](images/01_featuregroups.png)

First of all we will load the data and do some feature engineering on it.

### 📝 Import librararies 

In [1]:
# Import necessary libraries for feature engineering
# common libaries for hashing and date time conversions
import hashlib
import datetime

# pandas for feature engineering 
import pandas as pd
import numpy as np

## <span style="color:#ff5f27;"> 💽 Loading the Data </span>

The data we will use comes from three different CSV files:

- `transactions.csv`: transaction information such as timestamp, location, and the amount. 
- `alert_transactions.csv`: Suspicious Activity Report (SAR) transactions.
- `party.csv`: User profile information.

We can conceptualize these CSV files as originating from separate data sources.
**All three files have a credit card number column `cc_num` in common, which we can use for joins.**

Let's go ahead and load the data.

#### ⛳️ Transactions dataset

In [2]:
# Hops hdfs utility library for reading and writing files from HopsFs
from hops import hdfs
project_name = hdfs.project_name()
project_path = hdfs.project_path()

transactions_df = pd.read_csv(
    f"{project_path}/Jupyter/AMLend2end/demodata/transactions.csv",
    parse_dates = ['tran_timestamp']
)

transactions_df

  self.client = HadoopFileSystem(


Unnamed: 0,tran_id,tx_type,base_amt,tran_timestamp,src,dst
0,496,TRANSFER-FanOut,858.77,2020-01-01 00:00:00+00:00,3aa9646b,1e46e726
1,1342,TRANSFER-Mutual,386.86,2020-01-01 00:00:00+00:00,49203bc3,a74d1101
2,1580,TRANSFER-FanOut,616.43,2020-01-02 00:00:00+00:00,616d4505,99af2455
3,2866,TRANSFER-FanOut,146.44,2020-01-02 00:00:00+00:00,39be1ea2,e7ec7bdb
4,3997,TRANSFER-Mutual,439.09,2020-01-03 00:00:00+00:00,e2e0d938,afc399a9
...,...,...,...,...,...,...
438381,1026833,TRANSFER-Forward,896.02,2021-12-18 00:00:00+00:00,a4280e2f,969c0f15
438382,1028252,TRANSFER-Periodical,639.81,2021-12-19 00:00:00+00:00,19d05ee4,c3699eca
438383,1028381,TRANSFER-Mutual,388.36,2021-12-20 00:00:00+00:00,53cfa390,5aae5f9f
438384,1028955,TRANSFER-Periodical,352.16,2021-12-20 00:00:00+00:00,0c7d38a4,b0b014d9


#### ⛳️ Alert Transactions dataset

In [3]:
alert_transactions = pd.read_csv(f"{project_path}/Jupyter/AMLend2end/demodata/alert_transactions.csv")
alert_transactions.head()

Unnamed: 0,alert_id,alert_type,is_sar,tran_id
0,47,gather_scatter,True,11873
1,47,gather_scatter,True,11874
2,47,gather_scatter,True,11875
3,47,gather_scatter,True,13151
4,47,gather_scatter,True,23148


#### ⛳️ Party dataset

In [4]:
party = pd.read_csv(f"{project_path}/Jupyter/AMLend2end/demodata/party.csv")
party.head()

Unnamed: 0,partyId,partyType
0,5628bd6c,Organization
1,a1fcba39,Organization
2,f56c9501,Individual
3,9969afdd,Organization
4,b356eeae,Individual


## <span style="color:#ff5f27;"> 🛠️ Feature Engineering </span>

#### To investigate patterns of Suspicious Activities we will make time window aggregates such monthly frequency, total, mean and standard deviation of amount of incoming and outgoing transasactions.  


In [5]:
transactions_df.columns = ['tran_id', 'tx_type', 'base_amt', 'tran_timestamp', 'source', 'target']
transactions_df = transactions_df[["source","target","tran_timestamp","tran_id", "base_amt"]]
transactions_df.head()

Unnamed: 0,source,target,tran_timestamp,tran_id,base_amt
0,3aa9646b,1e46e726,2020-01-01 00:00:00+00:00,496,858.77
1,49203bc3,a74d1101,2020-01-01 00:00:00+00:00,1342,386.86
2,616d4505,99af2455,2020-01-02 00:00:00+00:00,1580,616.43
3,39be1ea2,e7ec7bdb,2020-01-02 00:00:00+00:00,2866,146.44
4,e2e0d938,afc399a9,2020-01-03 00:00:00+00:00,3997,439.09


##### Outgoing transactions

In [6]:
out_df = transactions_df.groupby([pd.Grouper(key='tran_timestamp', freq='M'), 'source'])\
                            .agg(monthly_count=('source','count'), 
                                 monthly_total_amount=('base_amt','sum'),
                                 monthly_mean_amount=('base_amt','mean'),
                                 monthly_std_amount=('base_amt','std')
                                )
out_df = out_df.reset_index(level=["source"])
out_df = out_df.reset_index(level=["tran_timestamp"])
out_df.columns  = ["tran_timestamp", "id", "monthly_out_count", "monthly_out_total_amount", "monthly_out_mean_amount", "monthly_out_std_amount"]
out_df.tran_timestamp = out_df.tran_timestamp.values.astype(np.int64) // 10 ** 6
out_df

Unnamed: 0,tran_timestamp,id,monthly_out_count,monthly_out_total_amount,monthly_out_mean_amount,monthly_out_std_amount
0,1580428800000,0016359b,4,1843.32,460.830000,252.951744
1,1580428800000,0019b8d0,6,3074.78,512.463333,308.247279
2,1580428800000,00298665,1,521.11,521.110000,
3,1580428800000,003e2533,3,1440.05,480.016667,251.265814
4,1580428800000,00498ec2,5,3414.95,682.990000,58.726141
...,...,...,...,...,...,...
135063,1640908800000,ffb8c4c8,2,1259.38,629.690000,516.371798
135064,1640908800000,ffc0c534,1,258.73,258.730000,
135065,1640908800000,ffc145fd,3,1149.78,383.260000,207.140896
135066,1640908800000,ffc288ac,2,1891.42,945.710000,24.480037


##### Incoming transactions

In [7]:
in_df = transactions_df.groupby([pd.Grouper(key='tran_timestamp', freq='M'), 'target'])\
                            .agg(monthly_count=('target','count'), 
                                 monthly_total_amount=('base_amt','sum'),
                                 monthly_mean_amount=('base_amt','mean'),
                                 monthly_std_amount=('base_amt','std'))

in_df = in_df.reset_index(level=["target"])
in_df = in_df.reset_index(level=["tran_timestamp"])
in_df.columns  = ["tran_timestamp", "id", "monthly_in_count", "monthly_in_total_amount", "monthly_in_mean_amount", "monthly_in_std_amount"]
in_df.tran_timestamp = in_df.tran_timestamp.values.astype(np.int64) // 10 ** 6
in_df

Unnamed: 0,tran_timestamp,id,monthly_in_count,monthly_in_total_amount,monthly_in_mean_amount,monthly_in_std_amount
0,1580428800000,0016359b,4,1872.92,468.230000,175.274700
1,1580428800000,001dcc27,9,5874.64,652.737778,271.236889
2,1580428800000,00298665,1,755.64,755.640000,
3,1580428800000,003cd8f3,4,2678.30,669.575000,179.515750
4,1580428800000,003e2533,4,3328.82,832.205000,172.041666
...,...,...,...,...,...,...
135763,1640908800000,ffa51ef4,3,1645.87,548.623333,175.806772
135764,1640908800000,ffb07a36,5,3633.74,726.748000,255.681639
135765,1640908800000,ffb8666b,1,494.51,494.510000,
135766,1640908800000,ffb8c4c8,3,1133.14,377.713333,155.277241


##### Now lets join incoming and outgoing transcations datasets

In [8]:
in_out_df = in_df.merge(out_df, on=['tran_timestamp', 'id'], how="outer")
in_out_df =  in_out_df.fillna(0)
in_out_df

Unnamed: 0,tran_timestamp,id,monthly_in_count,monthly_in_total_amount,monthly_in_mean_amount,monthly_in_std_amount,monthly_out_count,monthly_out_total_amount,monthly_out_mean_amount,monthly_out_std_amount
0,1580428800000,0016359b,4.0,1872.92,468.230000,175.274700,4.0,1843.32,460.830000,252.951744
1,1580428800000,001dcc27,9.0,5874.64,652.737778,271.236889,0.0,0.00,0.000000,0.000000
2,1580428800000,00298665,1.0,755.64,755.640000,0.000000,1.0,521.11,521.110000,0.000000
3,1580428800000,003cd8f3,4.0,2678.30,669.575000,179.515750,0.0,0.00,0.000000,0.000000
4,1580428800000,003e2533,4.0,3328.82,832.205000,172.041666,3.0,1440.05,480.016667,251.265814
...,...,...,...,...,...,...,...,...,...,...
170871,1640908800000,ff3014ea,0.0,0.00,0.000000,0.000000,6.0,3028.86,504.810000,267.244355
170872,1640908800000,ff34a03c,0.0,0.00,0.000000,0.000000,1.0,665.06,665.060000,0.000000
170873,1640908800000,ffc145fd,0.0,0.00,0.000000,0.000000,3.0,1149.78,383.260000,207.140896
170874,1640908800000,ffc288ac,0.0,0.00,0.000000,0.000000,2.0,1891.42,945.710000,24.480037


#### Assign labels to transuctons that were identified as suspicius activity

In [9]:
alert_transactions

Unnamed: 0,alert_id,alert_type,is_sar,tran_id
0,47,gather_scatter,True,11873
1,47,gather_scatter,True,11874
2,47,gather_scatter,True,11875
3,47,gather_scatter,True,13151
4,47,gather_scatter,True,23148
...,...,...,...,...
910,78,cycle,True,1016723
911,78,cycle,True,1016724
912,78,cycle,True,1017521
913,18,scatter_gather,True,1021971


In [10]:
transaction_labels = transactions_df[["source","target","tran_id","tran_timestamp"]].merge(alert_transactions[["is_sar", "tran_id"]], on=["tran_id"], how="left")
transaction_labels.is_sar = transaction_labels.is_sar.map({True: 1, np.nan: 0})
transaction_labels.sort_values('tran_id',inplace = True)
transaction_labels.head()

Unnamed: 0,source,target,tran_id,tran_timestamp,is_sar
322886,cee9cf6d,79c248ae,2,2020-01-01 00:00:00+00:00,0
307052,65ab2f44,b20ce84b,3,2020-01-01 00:00:00+00:00,0
181198,2a39b731,a07edae4,4,2020-01-01 00:00:00+00:00,0
36864,528b9346,dc34c867,6,2020-01-01 00:00:00+00:00,0
351553,cc668310,b1d20498,7,2020-01-01 00:00:00+00:00,0


#### Now lets prepare profile (party) dataset and assign lables whether they have been reported for suspicius activity or not 

In [11]:
party.columns = ["id","type"]
party.type = party.type.map({"Individual": 0, "Organization": 1})

party.head()

Unnamed: 0,id,type
0,5628bd6c,1
1,a1fcba39,1
2,f56c9501,0
3,9969afdd,1
4,b356eeae,0


In [12]:
alert_transactions = transaction_labels[transaction_labels.is_sar ==1]
alert_transactions.head()

Unnamed: 0,source,target,tran_id,tran_timestamp,is_sar
41322,5e7442f1,0bffd1da,11873,2020-01-09 00:00:00+00:00,1
62128,65c7b5a1,0bffd1da,11874,2020-01-09 00:00:00+00:00,1
57575,04128f28,0bffd1da,11875,2020-01-09 00:00:00+00:00,1
85346,462568a4,0bffd1da,13151,2020-01-10 00:00:00+00:00,1
59894,0bffd1da,4a1c2abc,23148,2020-01-17 00:00:00+00:00,1


In [13]:
alert_transactions = transaction_labels[transaction_labels.is_sar ==1]
alert_sources = alert_transactions[["source", "tran_timestamp"]]
alert_sources.columns = ["id", "tran_timestamp"]
alert_sources.head()
alert_targets = alert_transactions[["target", "tran_timestamp"]]
alert_targets.columns = ["id", "tran_timestamp"]
sar_party = alert_sources.append(alert_targets, ignore_index=True)
sar_party.sort_values(["id", "tran_timestamp"], ascending = [False, True])

# find a 1st occurence of sar per id
sar_party = sar_party.iloc[[sar_party.id.eq(id).idxmax() for id in sar_party['id'].value_counts().index]]
sar_party = sar_party.groupby([pd.Grouper(key='tran_timestamp', freq='M'), 'id']).agg(monthly_count=('id','count'))
sar_party = sar_party.reset_index(level=["id"])
sar_party = sar_party.reset_index(level=["tran_timestamp"])
sar_party.drop(["monthly_count"], axis=1, inplace=True)

sar_party["is_sar"] = sar_party["is_sar"] = 1
sar_party

Unnamed: 0,tran_timestamp,id,is_sar
0,2020-01-31 00:00:00+00:00,04128f28,1
1,2020-01-31 00:00:00+00:00,0bffd1da,1
2,2020-01-31 00:00:00+00:00,101c8a90,1
3,2020-01-31 00:00:00+00:00,462568a4,1
4,2020-01-31 00:00:00+00:00,4a1c2abc,1
...,...,...,...
811,2021-12-31 00:00:00+00:00,bc0ed20b,1
812,2021-12-31 00:00:00+00:00,e5cd7b59,1
813,2021-12-31 00:00:00+00:00,eac8b9c1,1
814,2021-12-31 00:00:00+00:00,f30f26a2,1


In [14]:
party_labels = party.merge(sar_party, on=["id"], how="left")
party_labels.is_sar = party_labels.is_sar.map({1.0: 1, np.nan: 0})
max_time_stamp = datetime.datetime.utcfromtimestamp(int(max(transaction_labels.tran_timestamp.values))/1e9)
party_labels = party_labels.fillna(max_time_stamp)

In [15]:
party_labels[party_labels.is_sar == 1]

Unnamed: 0,id,type,tran_timestamp,is_sar
11,33a8ff5b,1,2021-08-31 00:00:00+00:00,1
16,8b9017b8,1,2021-07-31 00:00:00+00:00,1
17,fcf3bbf3,0,2020-02-29 00:00:00+00:00,1
32,43e028ef,1,2021-06-30 00:00:00+00:00,1
46,9c187eed,1,2021-07-31 00:00:00+00:00,1
...,...,...,...,...
7281,a3351e52,0,2021-11-30 00:00:00+00:00,1
7316,8d935dce,0,2021-01-31 00:00:00+00:00,1
7321,80e982d4,1,2021-09-30 00:00:00+00:00,1
7326,5f4ec727,1,2021-04-30 00:00:00+00:00,1


In [16]:
party_labels[party_labels.is_sar == 0]

Unnamed: 0,id,type,tran_timestamp,is_sar
0,5628bd6c,1,2021-12-20 00:00:00,0
1,a1fcba39,1,2021-12-20 00:00:00,0
2,f56c9501,0,2021-12-20 00:00:00,0
3,9969afdd,1,2021-12-20 00:00:00,0
4,b356eeae,0,2021-12-20 00:00:00,0
...,...,...,...,...
7342,9f247241,0,2021-12-20 00:00:00,0
7343,152326aa,1,2021-12-20 00:00:00,0
7344,beba2a0b,1,2021-12-20 00:00:00,0
7345,8d72f42b,0,2021-12-20 00:00:00,0


### Graph representational learning using graph convolution layer

Finanial transactions can be represented as a dynamic network graph. Using technique of graph representation 
give as opportunity to represnet transaction with a broader context. In this examples we will perfom node 
representation leaning. 

Network architecture of the graph convolution layer for learning node represantion learning  was taken from 
[this Keras example](https://keras.io/examples/graph/gnn_citations/).  It performs  performs the following steps:

1. **Prepare**: The input node representations are processed using a FFN to produce a *message*. You can simplify
the processing by only applying linear transformation to the representations.
2. **Aggregate**: The messages of the neighbours of each node are aggregated with
respect to the `edge_weights` using a *permutation invariant* pooling operation, such as *sum*, *mean*, and *max*,
to prepare a single aggregated message for each node. See, for example, [tf.math.unsorted_segment_sum](https://www.tensorflow.org/api_docs/python/tf/math/unsorted_segment_sum)
APIs used to aggregate neighbour messages.
3. **Update**: The `node_repesentations` and `aggregated_messages`—both of shape `[num_nodes, representation_dim]`—
are combined and processed to produce the new state of the node representations (node embeddings).
If `combination_type` is `gru`, the `node_repesentations` and `aggregated_messages` are stacked to create a sequence,
then processed by a GRU layer. Otherwise, the `node_repesentations` and `aggregated_messages` are added
or concatenated, then processed using a FFN.


In [17]:
# import libraries to compute graph embeddings 
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

import networkx as nx

def create_ffn(hidden_units, dropout_rate, name=None):
    fnn_layers = []

    for units in hidden_units:
        fnn_layers.append(layers.BatchNormalization())
        fnn_layers.append(layers.Dropout(dropout_rate))
        fnn_layers.append(layers.Dense(units, activation=tf.nn.gelu))

    return keras.Sequential(fnn_layers, name=name)

class GraphConvLayer(layers.Layer):
    def __init__(
        self,
        hidden_units,
        dropout_rate=0.2,
        aggregation_type="mean",
        combination_type="concat",
        normalize=False,
        *args,
        **kwargs,
    ):
        super(GraphConvLayer, self).__init__(*args, **kwargs)

        self.aggregation_type = aggregation_type
        self.combination_type = combination_type
        self.normalize = normalize

        self.ffn_prepare = create_ffn(hidden_units, dropout_rate)
        if self.combination_type == "gated":
            self.update_fn = layers.GRU(
                units=hidden_units,
                activation="tanh",
                recurrent_activation="sigmoid",
                dropout=dropout_rate,
                return_state=True,
                recurrent_dropout=dropout_rate,
            )
        else:
            self.update_fn = create_ffn(hidden_units, dropout_rate)

    def prepare(self, node_repesentations, weights=None):
        # node_repesentations shape is [num_edges, embedding_dim].
        messages = self.ffn_prepare(node_repesentations)
        if weights is not None:
            messages = messages * tf.expand_dims(weights, -1)
        return messages

    def aggregate(self, node_indices, neighbour_messages):
        # node_indices shape is [num_edges].
        # neighbour_messages shape: [num_edges, representation_dim].
        num_nodes = tf.math.reduce_max(node_indices) + 1
        if self.aggregation_type == "sum":
            aggregated_message = tf.math.unsorted_segment_sum(
                neighbour_messages, node_indices, num_segments=num_nodes
            )
        elif self.aggregation_type == "mean":
            aggregated_message = tf.math.unsorted_segment_mean(
                neighbour_messages, node_indices, num_segments=num_nodes
            )
        elif self.aggregation_type == "max":
            aggregated_message = tf.math.unsorted_segment_max(
                neighbour_messages, node_indices, num_segments=num_nodes
            )
        else:
            raise ValueError(f"Invalid aggregation type: {self.aggregation_type}.")

        return aggregated_message

    def update(self, node_repesentations, aggregated_messages):
        # node_repesentations shape is [num_nodes, representation_dim].
        # aggregated_messages shape is [num_nodes, representation_dim].
        if self.combination_type == "gru":
            # Create a sequence of two elements for the GRU layer.
            h = tf.stack([node_repesentations, aggregated_messages], axis=1)
        elif self.combination_type == "concat":
            # Concatenate the node_repesentations and aggregated_messages.
            h = tf.concat([node_repesentations, aggregated_messages], axis=1)
        elif self.combination_type == "add":
            # Add node_repesentations and aggregated_messages.
            h = node_repesentations + aggregated_messages
        else:
            raise ValueError(f"Invalid combination type: {self.combination_type}.")

        # Apply the processing function.
        node_embeddings = self.update_fn(h)
        if self.combination_type == "gru":
            node_embeddings = tf.unstack(node_embeddings, axis=1)[-1]

        if self.normalize:
            node_embeddings = tf.nn.l2_normalize(node_embeddings, axis=-1)
        return node_embeddings

    def call(self, inputs):
        """Process the inputs to produce the node_embeddings.

        inputs: a tuple of three elements: node_repesentations, edges, edge_weights.
        Returns: node_embeddings of shape [num_nodes, representation_dim].
        """

        node_repesentations, edges, edge_weights = inputs
        # Get node_indices (source) and neighbour_indices (target) from edges.
        node_indices, neighbour_indices = edges[0], edges[1]
        # neighbour_repesentations shape is [num_edges, representation_dim].
        neighbour_repesentations = tf.gather(node_repesentations, neighbour_indices)

        # Prepare the messages of the neighbours.
        neighbour_messages = self.prepare(neighbour_repesentations, edge_weights)
        # Aggregate the neighbour messages.
        aggregated_messages = self.aggregate(node_indices, neighbour_messages)
        # Update the node embedding with the neighbour messages.
        return self.update(node_repesentations, aggregated_messages)


class GNNNodeClassifier(tf.keras.Model):
    def __init__(
        self,
        graph_info,
        hidden_units,
        aggregation_type="sum",
        combination_type="concat",
        dropout_rate=0.2,
        normalize=True,
        *args,
        **kwargs,
    ):
        super(GNNNodeClassifier, self).__init__(*args, **kwargs)

        # Unpack graph_info to three elements: node_features, edges, and edge_weight.
        node_features, edges, edge_weights = graph_info
        self.node_features = node_features
        self.edges = edges
        self.edge_weights = edge_weights
        # Set edge_weights to ones if not provided.
        if self.edge_weights is None:
            self.edge_weights = tf.ones(shape=edges.shape[1])
        # Scale edge_weights to sum to 1.
        self.edge_weights = self.edge_weights / tf.math.reduce_sum(self.edge_weights)

        # Create a process layer.
        self.preprocess = create_ffn(hidden_units, dropout_rate, name="preprocess")
        # Create the first GraphConv layer.
        self.conv1 = GraphConvLayer(
            hidden_units,
            dropout_rate,
            aggregation_type,
            combination_type,
            normalize,
            name="graph_conv1",
        )
        # Create the second GraphConv layer.
        self.conv2 = GraphConvLayer(
            hidden_units,
            dropout_rate,
            aggregation_type,
            combination_type,
            normalize,
            name="graph_conv2",
        )
        # Create a postprocess layer.
        self.postprocess = create_ffn(hidden_units, dropout_rate, name="postprocess")
        # Create a compute logits layer.
        self.compute_logits = layers.Dense(hidden_units[0],  activation=tf.nn.tanh, name="logits")
        

    def call(self, input_node_indices):
        # Preprocess the node_features to produce node representations.
        x = self.preprocess(self.node_features)
        # Apply the first graph conv layer.
        x1 = self.conv1((x, self.edges, self.edge_weights))
        # Skip connection.
        x = x1 + x
        # Apply the second graph conv layer.
        x2 = self.conv2((x, self.edges, self.edge_weights))
        # Skip connection.
        x = x2 + x
        # Postprocess node embedding.
        x = self.postprocess(x)
        # Fetch node embeddings for the input node_indices.
        node_embeddings = tf.gather(x, input_node_indices)
        # Compute logits
        return self.compute_logits(node_embeddings)


In [18]:
def construct_gruph(input_df):
    sampled_party = party_labels[party_labels.id.isin(input_df.source) | (party_labels.id.isin(input_df.target))]
    sampled_party = sampled_party [["id", "type", "is_sar"]]
    sampled_party

    # assigne unquie interger ids to each node to be compatible with thensorlfow
    unique_ids = set()
    for id in sampled_party.id.values:
      unique_ids.add(id)
    id_dict = {}

    for i, idn in enumerate(unique_ids):
        id_dict[idn]=i

    sampled_party['int_id'] = sampled_party['id'].apply(lambda x : id_dict[x])
    input_df['source'] = input_df['source'].apply(lambda x : id_dict[x])
    input_df['target'] = input_df['target'].apply(lambda x : id_dict[x])

    # construct graph info
    feature_names = ["type"]
    x_train = sampled_party.int_id.to_numpy()

    # Create an edges array (sparse adjacency matrix) of shape [2, num_edges].
    edges = input_df[["source", "target"]].to_numpy().T

    # Create an edge weights array of ones.
    edge_weights = tf.ones(shape=edges.shape[1])
    # Create a node features array of shape [num_nodes, num_features].
    node_features = tf.cast(
        sampled_party.sort_values("id")[feature_names].to_numpy(), dtype=tf.dtypes.float32
    )
    # Create graph info tuple with node_features, edges, and edge_weights.
    graph_info = (node_features, edges, edge_weights)

    print("Edges shape:", edges.shape)
    print("Nodes shape:", node_features.shape)

    node_features, edges, edge_weights = graph_info

    # hyper parameter for graph embeddings model
    hidden_units = [32, 32]
    learning_rate = 0.01
    dropout_rate = 0.5
    num_epochs = 2
    batch_size = 256

    # Construct the model
    model = GNNNodeClassifier(
        graph_info=graph_info,
        hidden_units=hidden_units,
        dropout_rate=dropout_rate,
        name="gnn_model",
    )

    # Compile the model.
    model.compile(
            #optimizer=keras.optimizers.Adam(learning_rate),
            optimizer=keras.optimizers.RMSprop(learning_rate=learning_rate),
            loss=keras.losses.MeanSquaredError(),    
            metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")],
        )
    
    # Fit the model.
    history = model.fit(
            x=x_train,
            y=x_train,
            epochs=num_epochs,
            batch_size=batch_size,
        )
    # predict and return
    return {"id": sampled_party.id.to_numpy(), "graph_embeddings": list(model.predict(x_train).reshape(node_features.shape[0], hidden_units[0]))}

#### Compute time evolving graph embeddings

In [19]:
transaction_graphs_by_month = transaction_labels.groupby(pd.Grouper(key='tran_timestamp', freq='M')).apply(lambda x: construct_gruph(x))                 

Edges shape: (2, 19126)
Nodes shape: (7161, 1)
Epoch 1/2
Epoch 2/2
Edges shape: (2, 17644)
Nodes shape: (7094, 1)
Epoch 1/2
Epoch 2/2
Edges shape: (2, 18750)
Nodes shape: (7129, 1)
Epoch 1/2
Epoch 2/2
Edges shape: (2, 18268)
Nodes shape: (7135, 1)
Epoch 1/2
Epoch 2/2
Edges shape: (2, 18925)
Nodes shape: (7140, 1)
Epoch 1/2
Epoch 2/2
Edges shape: (2, 18221)
Nodes shape: (7127, 1)
Epoch 1/2
Epoch 2/2
Edges shape: (2, 18754)
Nodes shape: (7114, 1)
Epoch 1/2
Epoch 2/2
Edges shape: (2, 18849)
Nodes shape: (7140, 1)
Epoch 1/2
Epoch 2/2
Edges shape: (2, 18481)
Nodes shape: (7156, 1)
Epoch 1/2
Epoch 2/2
Edges shape: (2, 18774)
Nodes shape: (7120, 1)
Epoch 1/2
Epoch 2/2
Edges shape: (2, 18030)
Nodes shape: (7116, 1)
Epoch 1/2
Epoch 2/2
Edges shape: (2, 19102)
Nodes shape: (7159, 1)
Epoch 1/2
Epoch 2/2
Edges shape: (2, 18857)
Nodes shape: (7114, 1)
Epoch 1/2
Epoch 2/2
Edges shape: (2, 16894)
Nodes shape: (7094, 1)
Epoch 1/2
Epoch 2/2
Edges shape: (2, 18969)
Nodes shape: (7138, 1)
Epoch 1/2
Epoch

In [20]:
timestamps = transaction_graphs_by_month.index.values
graph_embeddings = transaction_graphs_by_month.tolist()

In [21]:
graph_embdeddings_df = pd.DataFrame()
for timestamp, graph_embedding in zip(timestamps, graph_embeddings):
    df_tmp = pd.DataFrame(graph_embedding)
    df_tmp["tran_timestamp"] = timestamp
    graph_embdeddings_df = graph_embdeddings_df.append(df_tmp)    
graph_embdeddings_df

Unnamed: 0,id,graph_embeddings,tran_timestamp
0,5628bd6c,"[0.9997027, 0.99976456, 0.9993363, 0.9994033, ...",2020-01-31
1,a1fcba39,"[0.999837, 0.99992293, 0.99986047, 0.99988055,...",2020-01-31
2,f56c9501,"[0.9998372, 0.99992293, 0.9998606, 0.99988055,...",2020-01-31
3,9969afdd,"[0.99970245, 0.99976456, 0.99933624, 0.9994033...",2020-01-31
4,b356eeae,"[0.9997025, 0.9997645, 0.9993363, 0.99940336, ...",2020-01-31
...,...,...,...
6845,87a2d0ba,"[0.99995875, 0.999986, 0.99998206, 0.9998698, ...",2021-12-31
6846,9f247241,"[0.9999325, 0.99998426, 0.99997705, 0.9998464,...",2021-12-31
6847,152326aa,"[0.9999325, 0.99998426, 0.99997705, 0.9998464,...",2021-12-31
6848,beba2a0b,"[0.9999325, 0.99998426, 0.99997705, 0.9998464,...",2021-12-31


#### Convert date time to unix epoc milliseconds 

In [22]:
transaction_labels.tran_timestamp = transaction_labels.tran_timestamp.values.astype(np.int64) // 10 ** 6
graph_embdeddings_df.tran_timestamp = graph_embdeddings_df.tran_timestamp.values.astype(np.int64) // 10 ** 6
party_labels.tran_timestamp = party_labels.tran_timestamp.map(lambda x: datetime.datetime.timestamp(x) * 1000)
party_labels.tran_timestamp = party_labels.tran_timestamp.values.astype(np.int64)

---

# 👮🏼‍♀️ Data Validation 

Before we define [feature groups](https://docs.hopsworks.ai/latest/generated/feature_group/) lets define [validation rules](https://docs.hopsworks.ai/latest/generated/feature_validation/) for features. We do expect some of the features to comply with certain *rules* or *expectations*. For example: a transacted amount must be a positive value. In the case of a transacted amount arriving as a negative value we can decide whether to stop it from `write` into a feature group and throw an error or allow it to be written but provide a warning. In the next section we will create feature store `expectations`, attach them to feature groups, and apply them to dataframes being appended to said feature group.

#### Data validation with Greate Expectations in Hopsworks
You can use GE library for validation in Hopsworks features store. 

##  <img src="images/icon102.png" width="18px"></img> HSFS library

The Hopsworks feature feature store library is called `hsfs` (**H**opswork**s** **F**eature **S**tore). 
The library is Apache V2 licensed and available [here](https://github.com/logicalclocks/feature-store-api). The library is currently available for Python and JVM languages such as Scala and Java.
In this notebook, we are going to cover Python part.

You can find the complete documentation of the library here: 

The first step is to establish a connection with your Hopsworks feature store instance and retrieve the object that represents the feature store you'll be working with. 

> By default `connection.get_feature_store()` returns the feature store of the project we are working with. However, it accepts also a project name as parameter to select a different feature store.

In [23]:
import hsfs
from hsfs.rule import Rule

# Create a connection
connection = hsfs.connection()

# Get the feature store handle for the project's feature store
fs = connection.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


### 🔬 Expectations suite

In [24]:
# Define Expectation Suite - no use of HSFS
import great_expectations as ge
from pprint import pprint
import json

expectation_suite = ge.core.ExpectationSuite(expectation_suite_name="aml_project_validations")
pprint(expectation_suite.to_json_dict(), indent=2)

{ 'data_asset_type': None,
  'expectation_suite_name': 'aml_project_validations',
  'expectations': [],
  'ge_cloud_id': None,
  'meta': {'great_expectations_version': '0.14.3'}}


In [25]:
expectation_suite.add_expectation(
  ge.core.ExpectationConfiguration(
  expectation_type="expect_column_max_to_be_between",
  kwargs={"column_list": [list(set(in_out_df.columns) - {"tran_timestamp", "id"})], "min_value": 0, "max_value": 10000000}) 
)

{"kwargs": {"column_list": [["monthly_in_std_amount", "monthly_in_mean_amount", "monthly_out_mean_amount", "monthly_in_count", "monthly_in_total_amount", "monthly_out_count", "monthly_out_total_amount", "monthly_out_std_amount"]], "min_value": 0, "max_value": 10000000}, "expectation_type": "expect_column_max_to_be_between", "meta": {}}

In [26]:
pprint(expectation_suite)

{
  "expectations": [
    {
      "kwargs": {
        "column_list": [
          [
            "monthly_in_std_amount",
            "monthly_in_mean_amount",
            "monthly_out_mean_amount",
            "monthly_in_count",
            "monthly_in_total_amount",
            "monthly_out_count",
            "monthly_out_total_amount",
            "monthly_out_std_amount"
          ]
        ],
        "min_value": 0,
        "max_value": 10000000
      },
      "expectation_type": "expect_column_max_to_be_between",
      "meta": {}
    }
  ],
  "data_asset_type": null,
  "meta": {
    "great_expectations_version": "0.14.3"
  },
  "expectation_suite_name": "aml_project_validations",
  "ge_cloud_id": null
}


---

## <span style="color:#ff5f27;"> 🪄 Register Feature Groups </span>

### Feature Groups

A `Feature Groups` is a logical grouping of features, and experience has shown, that this grouping generally originates from the features being derived from the same data source. The `Feature Group` lets you save metadata along features, which defines how the Feature Store interprets them, combines them and reproduces training datasets created from them.

Generally, the features in a feature group are engineered together in an ingestion job. However, it is possible to have additional jobs to append features to an existing feature group. Furthermore, `feature groups` provide a way of defining a namespace for features, such that you can define features with the same name multiple times, but uniquely identified by the group they are contained in.

> It is important to note that `feature groups` are not groupings of features for immediate training of Machine Learning models. Instead, to ensure reusability of features, it is possible to combine features from any number of groups into training datasets.

# Point-in-Time (PIT) joins in Hopsworks Feature Store

Feature groups will be later used to create a training dataset and data scientists usually have to generate information about the future by putting themselves back in time, or travelling back in time.

In this demo we want to predict which of the transactions might be suspicous activity. To train such a model we need to construct a training dataset containing rows for each user, one column indicating whether a transaction was suspicious or not and X additional columns with features about the user, such as monthly transcation aggregates, node embeddings etc. To generate such prediction targets, we have to go back in time and determine which transactions were suspicios in the last couple of months or years.

For training data we would like to use feature signals which happened before the prediction target event (green signals) and events that happened after that has to be strictly avoided (light red signals). Thus we want to prevent a  possibility of leaking information from the future (light red signals) into the training dataset.

![image.png](images/pit-join.png)

To achieve this we want to feature store to remember the prediction target event time stamp on a user level (row level) and find the latest values of our prediction features before this point in time. The following diagrams illustrates this approach for one user. This challenge is solved by a point-in-time correct joins.

**Point-in-time joins** prevent feature leakage by recreating the state of the world at a single point in time for every entity or primary key value (user in our case).

**Hopsworks Feature Store** abstracts this complexity away by simply telling it where to find the relevant event time stamps for feature groups. We will go through the process in the rest of the notebook.

### Event-time enabled Feature Groups

For the Feature Store to be able to perform a PIT join, we need to tell it where to find the event time stamp within each feature group. Event-time is a timestamp indicating the instant in time when an event happened at the source of the event, so this is *not* an ingestion time stamp or the like, but instead should originate from your source systems.

To "event-time enable" a feature group, you set the `event_time` argument at feature group creation. We are using simple Integers to indicate the timestamps, for better readability.


#### Transactions monthly aggregates feature group

In [27]:
transactions_fg = fs.get_or_create_feature_group(
    name = "transactions_monthly_fg",
    version = 1,
    primary_key = ["id"],
    partition_key = ["tran_timestamp"],   
    description = "transactions monthly aggregates features",
    event_time = ['tran_timestamp'],
    time_travel_format = "HUDI",  
    online_enabled = True,
    statistics_config = {"enabled": True, "histograms": True, "correlations": True, "exact_uniqueness": False},
    #expectation_suite=expectation_suite
)   

transactions_fg.save(in_out_df)

Feature Group created successfully, explore it at 
https://hopsworks0.logicalclocks.com/p/119/fs/67/fg/13
Launching offline feature group backfill job...
Backfill Job started successfully, you can follow the progress at 
https://hopsworks0.logicalclocks.com/p/119/jobs/named/transactions_monthly_fg_1_offline_fg_backfill/executions


(<hsfs.core.job.Job at 0x7fab8daef7f0>, None)

#### Alert Transaction labels feature group

In [28]:
transaction_labels_fg = fs.get_or_create_feature_group(
    name = "transaction_labels_fg",
    version = 1,
    primary_key = ["tran_id"],
    partition_key = ["alert_type"],         
    description = "alert transactions",
    event_time = ['tran_timestamp'],    
    time_travel_format = "HUDI",     
    online_enabled = True,                                                
    statistics_config = {"enabled": True, "histograms": True, "correlations": True, "exact_uniqueness": False}
)

transaction_labels_fg.insert(transaction_labels)

Feature Group created successfully, explore it at 
https://hopsworks0.logicalclocks.com/p/119/fs/67/fg/14
Launching offline feature group backfill job...
Backfill Job started successfully, you can follow the progress at 
https://hopsworks0.logicalclocks.com/p/119/jobs/named/transaction_labels_fg_1_offline_fg_backfill/executions


(<hsfs.core.job.Job at 0x7fab8d835160>, None)

#### Party feature group

In [29]:
party_fg = fs.get_or_create_feature_group(
    name = "party_fg",
    version = 1,
    primary_key = ["id"],
    description = "party fg with labels",
    event_time = ['tran_timestamp'],        
    time_travel_format = "HUDI",
    online_enabled = True,
    statistics_config = {"enabled": True, "histograms": True, "correlations": True, "exact_uniqueness": False}
)

party_fg.insert(party_labels)

Feature Group created successfully, explore it at 
https://hopsworks0.logicalclocks.com/p/119/fs/67/fg/15
Launching offline feature group backfill job...
Backfill Job started successfully, you can follow the progress at 
https://hopsworks0.logicalclocks.com/p/119/jobs/named/party_fg_1_offline_fg_backfill/executions


(<hsfs.core.job.Job at 0x7fab8d83a5e0>, None)

#### Graph embeddings feature group

In [30]:
from hsfs import engine
features = engine.get_instance().parse_schema_feature_group(graph_embdeddings_df)
for f in features:
    if f.type == "array<float>":
        f.online_type = "VARBINARY(200)"   

In [31]:
graph_embeddings_fg = fs.get_or_create_feature_group(name="graph_embeddings_fg",
                                       version=1,
                                       primary_key=["id"],
                                       description="node embeddings from transactions graph",
                                       event_time = ['tran_timestamp'],      
                                       time_travel_format="HUDI",     
                                       online_enabled=True,                                                
                                       statistics_config={"enabled": False, "histograms": False, "correlations": False, "exact_uniqueness": False},
                                       features=features)

graph_embeddings_fg.insert(graph_embdeddings_df)

Feature Group created successfully, explore it at 
https://hopsworks0.logicalclocks.com/p/119/fs/67/fg/16
Launching offline feature group backfill job...
Backfill Job started successfully, you can follow the progress at 
https://hopsworks0.logicalclocks.com/p/119/jobs/named/graph_embeddings_fg_1_offline_fg_backfill/executions


(<hsfs.core.job.Job at 0x7fab8d843370>, None)

---