# <span style="font-width:bold; font-size: 3rem; color:#1EB182;"><img src="../../images/icon102.png" width="38px"></img> **Hopsworks Feature Store** </span><span style="font-width:bold; font-size: 3rem; color:#333;">- Part 01: Load, Engineer & Connect</span>

<span style="font-width:bold; font-size: 1.4rem;"> This is the first part of the AML tutorial. As part of this first module, you will work with data related to credit card transactions. 
The objective of this tutorial is to demonstrate how to work with the **Hopworks Feature Store** with a goal of training and deploying a model that can predict fraudulent transactions.</span>

## **🗒️ This notebook is divided into the following sections:** 
1. **Data Loading**: Load the data. 
2. **Feature Engineering**.
2. **Hopsworks Feature Store Connection**.
3. **Feature Groups Creation**: Create feature groups and upload them to the feature store.
4. **Explore feature groups from the UI**.

![tutorial-flow](../../images/01_featuregroups.png)

First of all we will load the data and do some feature engineering on it.

In [None]:
!pip install -U 'hopsworks[python, great_expectations]' --quiet

In [None]:
!pip install -r requirements.txt --quiet

## <span style="color:#ff5f27;"> 📝 Imports </span>

In [None]:
import hashlib
import datetime
import pandas as pd
import numpy as np

from pprint import pprint

from features.transactions import get_in_out_transactions
from features.party import get_transaction_labels, get_party_labels
from features.graph_embeddings import construct_graph

## <span style="color:#ff5f27;"> 💽 Loading the Data </span>

The data you will use comes from three different CSV files:

- `transactions.csv`: Transaction information such as timestamp, location, and the amount. 
- `alert_transactions.csv`: Suspicious Activity Report (SAR) transactions.
- `party.csv`: User profile information.

In a production system, these CSV files would originate from separate data sources or tables, and probably separate data pipelines. **All three files have a customer id column `id` in common, which we can use for joins.**

Let's go ahead and load the data.

### <span style="color:#ff5f27;"> ⛳️ Transactions dataset </span>

In [None]:
transactions_df = pd.read_csv(
    "https://repo.hops.works/master/hopsworks-tutorials/data/aml/transactions.csv", 
    parse_dates = ['tran_timestamp'],
)
transactions_df.head(3)

### <span style="color:#ff5f27;"> ⛳️ Alert Transactions dataset </span>

In [None]:
alert_transactions = pd.read_csv(
    "https://repo.hops.works/master/hopsworks-tutorials/data/aml/alert_transactions.csv",
)
alert_transactions.head(3)

### <span style="color:#ff5f27;"> ⛳️ Party dataset </span>

In [None]:
party = pd.read_csv(
    "https://repo.hops.works/master/hopsworks-tutorials/data/aml/party.csv",
)
party.head(3)

## <span style="color:#ff5f27;"> 🛠️ Feature Engineering </span>

To investigate patterns of suspicious activities you will make time window aggregates such monthly frequency, total, mean and standard deviation of amount of incoming and outgoing transasactions.  


In [None]:
# Renaming columns for clarity
transactions_df.columns = ['tran_id', 'tx_type', 'base_amt', 'tran_timestamp', 'source', 'target']

# Reordering columns for better readability
transactions_df = transactions_df[["source", "target", "tran_timestamp", "tran_id", "base_amt"]]

# Displaying the first few rows of the DataFrame
transactions_df.head(3)

### <span style="color:#ff5f27;">⛳️ Incoming and Outgoing transactions </span>

In [None]:
# Generating a DataFrame with monthly incoming and outgoing transaction statistics
in_out_df = get_in_out_transactions(transactions_df)

# Displaying the first few rows of the resulting DataFrame
in_out_df.head(3)

### <span style="color:#ff5f27;"> ⛳️ Transactions identified as suspicious activity </span>

Assign labels to transactions that were identified as suspicius activity.

In [None]:
# Displaying the first few rows of the 'alert_transactions' DataFrame
alert_transactions.head(3)

In [None]:
# Generating transaction labels based on transaction and alert transaction data
transaction_labels = get_transaction_labels(
    transactions_df, 
    alert_transactions,
)

# Displaying the first three rows of the resulting DataFrame
transaction_labels.head(3)

### <span style="color:#ff5f27;"> ⛳️ Party dataset </span>

Now lets prepare profile (party) dataset and assign lables whether they have been reported for suspicius activity or not.

In [None]:
# Renaming columns for clarity
party.columns = ["id", "type"]

# Mapping 'type' values to numerical values for better representation
party.type = party.type.map({"Individual": 0, "Organization": 1})

# Displaying the first three rows of the DataFrame
party.head(3)

In [None]:
# Filtering transactions with SAR(Suspicious Activity Reports) labels from the generated transaction labels DataFrame
alert_transactions = transaction_labels[transaction_labels.is_sar == 1]

# Displaying the first few rows of transactions flagged as SAR
alert_transactions.head(3)

In [None]:
# Generating party labels based on transaction labels and party information
party_labels = get_party_labels(
    transaction_labels, 
    party,
)

# Displaying the first three rows of the resulting DataFrame
party_labels.head(3)

## <span style="color:#ff5f27;">🧬 Graph representational learning using Graph Neural Network</span>

Finanial transactions can be represented as a dynamic network graph. Using technique of graph representation 
give as opportunity to represent transaction with a broader context. In this example you will perfom node 
representation learning. 

Network architecture of the graph convolution layer for learning node represantion learning  was taken from 
[this Keras example](https://keras.io/examples/graph/gnn_citations/).  It performs the following steps:

1. **Prepare**: The input node representations are processed using a FFN to produce a *message*. You can simplify
the processing by only applying linear transformation to the representations.
2. **Aggregate**: The messages of the neighbours of each node are aggregated with
respect to the `edge_weights` using a *permutation invariant* pooling operation, such as *sum*, *mean*, and *max*,
to prepare a single aggregated message for each node. See, for example, [tf.math.unsorted_segment_sum](https://www.tensorflow.org/api_docs/python/tf/math/unsorted_segment_sum)
APIs used to aggregate neighbour messages.
3. **Update**: The `node_repesentations` and `aggregated_messages`—both of shape `[num_nodes, representation_dim]`—
are combined and processed to produce the new state of the node representations (node embeddings).
If `combination_type` is `gru`, the `node_repesentations` and `aggregated_messages` are stacked to create a sequence,
then processed by a GRU layer. Otherwise, the `node_repesentations` and `aggregated_messages` are added
or concatenated, then processed using a FFN.


### <span style="color:#ff5f27;">🔮 Compute time evolving graph embeddings</span>

In [None]:
# Grouping transaction labels by month using pandas Grouper
transaction_graphs_by_month = transaction_labels.groupby(
    pd.Grouper(key='tran_timestamp', freq='M')
).apply(lambda x: construct_graph(x, party_labels))

# The resulting variable 'transaction_graphs_by_month' is a pandas DataFrame
# where each row corresponds to a month, and the 'graph_embeddings' column contains
# the node embeddings generated for each month using the 'construct_graph' function.
# The embeddings capture the graph structure of transactions during that month.

In [None]:
# Extracting timestamps and graph embeddings
timestamps = transaction_graphs_by_month.index.values
graph_embeddings = transaction_graphs_by_month.tolist()

In [None]:
# Creating an empty DataFrame to store graph embeddings
graph_embeddings_df = pd.DataFrame()

# Iterating through timestamps and corresponding graph embeddings
for timestamp, graph_embedding in zip(timestamps, graph_embeddings):
    # Creating a temporary DataFrame for each month's graph embeddings
    df_tmp = pd.DataFrame(graph_embedding)
    
    # Adding a 'tran_timestamp' column to store the timestamp for each row
    df_tmp["tran_timestamp"] = timestamp
    
    # Concatenating the temporary DataFrame to the main DataFrame
    graph_embeddings_df = pd.concat([graph_embeddings_df, df_tmp])

# Displaying the first three rows of the resulting DataFrame
graph_embeddings_df.head(3)

In [None]:
# Converting 'tran_timestamp' values to milliseconds for consistency
transaction_labels.tran_timestamp = transaction_labels.tran_timestamp.values.astype(np.int64) // 10 ** 6
graph_embeddings_df.tran_timestamp = graph_embeddings_df.tran_timestamp.values.astype(np.int64) // 10 ** 6

# Converting 'tran_timestamp' values in 'party_labels' to milliseconds
party_labels.tran_timestamp = party_labels.tran_timestamp.map(lambda x: datetime.datetime.timestamp(x) * 1000)
party_labels.tran_timestamp = party_labels.tran_timestamp.values.astype(np.int64)

## <span style="color:#ff5f27;">👮🏻‍♂️ Data Validation</span>

Before you define [feature groups](https://docs.hopsworks.ai/latest/generated/feature_group/) lets define [validation rules](https://docs.hopsworks.ai/latest/generated/feature_validation/) for features. You do expect some of the features to comply with certain *rules* or *expectations*. For example: a transacted amount must be a positive value. In the case of a transacted amount arriving as a negative value you can decide whether to stop it to `write` into a feature group and throw an error or allow it to be written but provide a warning. In the next section you will create feature store `expectations`, attach them to feature groups, and apply them to dataframes being appended to said feature group.

#### Data validation with Greate Expectations in Hopsworks
You can use GE library for validation in Hopsworks features store. 

##  <img src="../../images/icon102.png" width="18px"></img> Hopsworks feature store

The Hopsworks feature feature store library is Apache V2 licensed and available [here](https://github.com/logicalclocks/feature-store-api). The library is currently available for Python and JVM languages such as Scala and Java.
In this notebook, we are going to cover Python part.

You can find the complete documentation of the library here: 

The first step is to establish a connection with your Hopsworks feature store instance and retrieve the object that represents the feature store you'll be working with. 

> By default `project.get_feature_store()` returns the feature store of the project we are working with. However, it accepts also a project name as parameter to select a different feature store.

In [None]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

## <span style="color:#ff5f27;">🔬 Expectations suite</span>


In [None]:
import great_expectations as ge

In [None]:
# Creating an Expectation Suite named "aml_project_validations"
expectation_suite = ge.core.ExpectationSuite(
    expectation_suite_name="aml_project_validations",
)

# Displaying the JSON representation of the Expectation Suite
pprint(expectation_suite.to_json_dict(), indent=2)

In [None]:
# Adding an expectation to the Expectation Suite
expectation_suite.add_expectation(
    ge.core.ExpectationConfiguration(
        expectation_type="expect_column_max_to_be_between",
        kwargs={
            "column": "monthly_in_count", 
            "min_value": 0, 
            "max_value": 10000000,
        }
    )
)

# Displaying the updated Expectation Suite
pprint(expectation_suite.to_json_dict(), indent=2)

---

## <span style="color:#ff5f27;"> 🪄 Feature Groups Creation</span>

### Feature Groups

A `Feature Groups` is a logical grouping of features, and experience has shown, that this grouping generally originates from the features being derived from the same data source. The `Feature Group` lets you save metadata along features, which defines how the Feature Store interprets them, combines them and reproduces training datasets created from them.

Generally, the features in a feature group are engineered together in an ingestion job. However, it is possible to have additional jobs to append features to an existing feature group. Furthermore, `feature groups` provide a way of defining a namespace for features, such that you can define features with the same name multiple times, but uniquely identified by the group they are contained in.

> It is important to note that `feature groups` are not groupings of features for immediate training of Machine Learning models. Instead, to ensure reusability of features, it is possible to combine features from any number of groups into training datasets.

### <span style="color:#ff5f27;">⛳️ Transactions monthly aggregates Feature Group</span>


In [None]:
# Get or create the 'transactions_monthly' feature group
transactions_fg = fs.get_or_create_feature_group(
    name="transactions_monthly",
    version=1,
    primary_key=["id"],
    partition_key=["tran_timestamp"],   
    description="Transactions monthly aggregates features",
    event_time=['tran_timestamp'],
    online_enabled=True,
    stream=True,
    statistics_config={
        "enabled": True, 
        "histograms": True, 
        "correlations": True, 
        "exact_uniqueness": False,
    },
    expectation_suite=expectation_suite,
)   
# Insert data into the feature group
transactions_fg.insert(in_out_df)
print('✅ Done!')

### <span style="color:#ff5f27;">⛳️ Party Feature Group</span>

In [None]:
# Get or create the 'party_labels' feature group
party_fg = fs.get_or_create_feature_group(
    name = "party_labels",
    version = 1,
    primary_key = ["id"],
    description = "Party fg with labels",
    event_time = ['tran_timestamp'],        
    online_enabled = True,
    stream=True,
    statistics_config = {
        "enabled": True, 
        "histograms": True, 
        "correlations": True, 
        "exact_uniqueness": False,
    },
)
# Insert data into the feature group
party_fg.insert(party_labels)
print('✅ Done!')

### <span style="color:#ff5f27;">⛳️ Graph embeddings Feature Group</span>

In [None]:
from hsfs import embedding

# Create the Embedding Index
embedding_index = embedding.EmbeddingIndex()

embedding_length = graph_embeddings_df.graph_embeddings.iloc[0].shape[0]

embedding_index.add_embedding(
    "graph_embeddings",
    embedding_length,
)

In [None]:
# Get or create the 'graph_embeddings' feature group
graph_embeddings_fg = fs.get_or_create_feature_group(
    name="graph_embeddings",
    version=1,
    primary_key=["id"],
    description="Node embeddings from transactions graph",
    event_time = ['tran_timestamp'],      
    online_enabled=True,
    stream=True,
    statistics_config={
        "enabled": False,
        "histograms": False,
        "correlations": False, 
        "exact_uniqueness": False,
    },
    embedding_index=embedding_index,
)
# Insert data into the feature group
graph_embeddings_fg.insert(graph_embeddings_df)
print('✅ Done!')

---
## <span style="color:#ff5f27;"> 👓 Exploration </span>

### Feature groups are now accessible and searchable in the UI
![fg-overview](images/fg_explore.gif)

## 📊 Statistics
We can explore feature statistics in the feature groups. If statistics was not enabled when feature group was created then this can be done by:

```python
transactions_fg = fs.get_or_create_feature_group(
    name = "transactions_monthly_fg", 
    version = 1)

transactions_fg.statistics_config = {
    "enabled": True,
    "histograms": True,
    "correlations": True
}

transactions_fg.update_statistics_config()
transactions_fg.compute_statistics()
```

![fg-stats](images/freature_group_stats.gif)

---
## <span style="color:#ff5f27;"> ⏭️ **Next:** Part 02 </span>
    
In the next notebook you will create a training dataset, train and deploy a trained model.