### 📝 Import librararies 

In [1]:
import hashlib
import datetime
import pandas as pd
import numpy as np
from neo4j import GraphDatabase
from tqdm import tqdm

## <span style="color:#ff5f27;"> 💽 Loading the Data </span>

The data we will use comes from three different CSV files:

- `transactions.csv`: transaction information such as timestamp, location, and the amount. 
- `alert_transactions.csv`: Suspicious Activity Report (SAR) transactions.
- `party.csv`: User profile information.

In a production system, these CSV files would originate from separate data sources or tables, and probably separate data pipelines. **All three files have a customer id column `id` in common, which we can use for joins.**

Let's go ahead and load the data.

#### ⛳️ Transactions dataset

In [2]:
transactions_df = pd.read_csv("https://repo.hops.works/master/hopsworks-tutorials/data/aml/transactions.csv", parse_dates = ['tran_timestamp'])
transactions_df.head(5)

Unnamed: 0,tran_id,tx_type,base_amt,tran_timestamp,src,dst
0,496,TRANSFER-FanOut,858.77,2020-01-01 00:00:00+00:00,3aa9646b,1e46e726
1,1342,TRANSFER-Mutual,386.86,2020-01-01 00:00:00+00:00,49203bc3,a74d1101
2,1580,TRANSFER-FanOut,616.43,2020-01-02 00:00:00+00:00,616d4505,99af2455
3,2866,TRANSFER-FanOut,146.44,2020-01-02 00:00:00+00:00,39be1ea2,e7ec7bdb
4,3997,TRANSFER-Mutual,439.09,2020-01-03 00:00:00+00:00,e2e0d938,afc399a9


#### ⛳️ Alert Transactions dataset

In [3]:
alert_transactions = pd.read_csv("https://repo.hops.works/master/hopsworks-tutorials/data/aml/alert_transactions.csv")
alert_transactions.head()

Unnamed: 0,alert_id,alert_type,is_sar,tran_id
0,47,gather_scatter,True,11873
1,47,gather_scatter,True,11874
2,47,gather_scatter,True,11875
3,47,gather_scatter,True,13151
4,47,gather_scatter,True,23148


#### ⛳️ Party dataset

In [4]:
party = pd.read_csv("https://repo.hops.works/master/hopsworks-tutorials/data/aml/party.csv")
party.head()

Unnamed: 0,partyId,partyType
0,5628bd6c,Organization
1,a1fcba39,Organization
2,f56c9501,Individual
3,9969afdd,Organization
4,b356eeae,Individual


## <span style="color:#ff5f27;"> 🛠️ Feature Engineering </span>

#### To investigate patterns of suspicious activities you will make time window aggregates such monthly frequency, total, mean and standard deviation of amount of incoming and outgoing transasactions.  


In [5]:
# rename some columns
transactions_df = transactions_df.rename(columns={"src": "source",
                                                  "dst": "target"}, errors="raise")

# select interested columns
transactions_df = transactions_df[["source", "target", "tran_timestamp", "tran_id", "base_amt"]]
transactions_df.head()

Unnamed: 0,source,target,tran_timestamp,tran_id,base_amt
0,3aa9646b,1e46e726,2020-01-01 00:00:00+00:00,496,858.77
1,49203bc3,a74d1101,2020-01-01 00:00:00+00:00,1342,386.86
2,616d4505,99af2455,2020-01-02 00:00:00+00:00,1580,616.43
3,39be1ea2,e7ec7bdb,2020-01-02 00:00:00+00:00,2866,146.44
4,e2e0d938,afc399a9,2020-01-03 00:00:00+00:00,3997,439.09


##### Outgoing transactions

In [6]:
out_df = transactions_df.groupby([pd.Grouper(key='tran_timestamp', freq='M'), 'source'])\
                            .agg(monthly_count=('source','count'), 
                                 monthly_total_amount=('base_amt','sum'),
                                 monthly_mean_amount=('base_amt','mean'),
                                 monthly_std_amount=('base_amt','std')
                                )
out_df = out_df.reset_index(level=["source"])
out_df = out_df.reset_index(level=["tran_timestamp"])

# rename some columns
out_df = out_df.rename(columns={"source": "id",
                                                  "monthly_count": "monthly_out_count",
                                                  "monthly_total_amount": "monthly_out_total_amount",
                                                  "monthly_mean_amount": "monthly_out_mean_amount",
                                                  "monthly_std_amount": "monthly_out_std_amount"}, errors="raise")

out_df.tran_timestamp = out_df.tran_timestamp.values.astype(np.int64) // 10 ** 6
out_df.head(5)

Unnamed: 0,tran_timestamp,id,monthly_out_count,monthly_out_total_amount,monthly_out_mean_amount,monthly_out_std_amount
0,1580428800000,0016359b,4,1843.32,460.83,252.951744
1,1580428800000,0019b8d0,6,3074.78,512.463333,308.247279
2,1580428800000,00298665,1,521.11,521.11,
3,1580428800000,003e2533,3,1440.05,480.016667,251.265814
4,1580428800000,00498ec2,5,3414.95,682.99,58.726141


##### Incoming transactions

In [7]:
in_df = transactions_df.groupby([pd.Grouper(key='tran_timestamp', freq='M'), 'target'])\
                            .agg(monthly_count=('target','count'), 
                                 monthly_total_amount=('base_amt','sum'),
                                 monthly_mean_amount=('base_amt','mean'),
                                 monthly_std_amount=('base_amt','std'))

in_df = in_df.reset_index(level=["target"])
in_df = in_df.reset_index(level=["tran_timestamp"])
in_df.columns  = ["tran_timestamp", "id", "monthly_in_count", "monthly_in_total_amount", "monthly_in_mean_amount", "monthly_in_std_amount"]
in_df.tran_timestamp = in_df.tran_timestamp.values.astype(np.int64) // 10 ** 6
in_df.head(5)

Unnamed: 0,tran_timestamp,id,monthly_in_count,monthly_in_total_amount,monthly_in_mean_amount,monthly_in_std_amount
0,1580428800000,0016359b,4,1872.92,468.23,175.2747
1,1580428800000,001dcc27,9,5874.64,652.737778,271.236889
2,1580428800000,00298665,1,755.64,755.64,
3,1580428800000,003cd8f3,4,2678.3,669.575,179.51575
4,1580428800000,003e2533,4,3328.82,832.205,172.041666


##### Now lets join incoming and outgoing transcations datasets

In [8]:
in_out_df = in_df.merge(out_df, on=['tran_timestamp', 'id'], how="outer")
in_out_df =  in_out_df.fillna(0)
in_out_df.head(5)

Unnamed: 0,tran_timestamp,id,monthly_in_count,monthly_in_total_amount,monthly_in_mean_amount,monthly_in_std_amount,monthly_out_count,monthly_out_total_amount,monthly_out_mean_amount,monthly_out_std_amount
0,1580428800000,0016359b,4.0,1872.92,468.23,175.2747,4.0,1843.32,460.83,252.951744
1,1580428800000,001dcc27,9.0,5874.64,652.737778,271.236889,0.0,0.0,0.0,0.0
2,1580428800000,00298665,1.0,755.64,755.64,0.0,1.0,521.11,521.11,0.0
3,1580428800000,003cd8f3,4.0,2678.3,669.575,179.51575,0.0,0.0,0.0,0.0
4,1580428800000,003e2533,4.0,3328.82,832.205,172.041666,3.0,1440.05,480.016667,251.265814


#### Assign labels to transactions that were identified as suspicius activity

In [9]:
alert_transactions.head(5)

Unnamed: 0,alert_id,alert_type,is_sar,tran_id
0,47,gather_scatter,True,11873
1,47,gather_scatter,True,11874
2,47,gather_scatter,True,11875
3,47,gather_scatter,True,13151
4,47,gather_scatter,True,23148


In [10]:
transaction_labels = transactions_df[["source","target","tran_id","tran_timestamp"]].merge(alert_transactions[["is_sar", "tran_id"]], on=["tran_id"], how="left")
transaction_labels.is_sar = transaction_labels.is_sar.map({True: 1, np.nan: 0})
transaction_labels.sort_values('tran_id',inplace = True)
transaction_labels.head(5)

Unnamed: 0,source,target,tran_id,tran_timestamp,is_sar
322886,cee9cf6d,79c248ae,2,2020-01-01 00:00:00+00:00,0
307052,65ab2f44,b20ce84b,3,2020-01-01 00:00:00+00:00,0
181198,2a39b731,a07edae4,4,2020-01-01 00:00:00+00:00,0
36864,528b9346,dc34c867,6,2020-01-01 00:00:00+00:00,0
351553,cc668310,b1d20498,7,2020-01-01 00:00:00+00:00,0


#### Now lets prepare profile (party) dataset and assign lables whether they have been reported for suspicius activity or not 

In [11]:
party.columns = ["id","type"]
party.type = party.type.map({"Individual": 0, "Organization": 1})

party.head(5)

Unnamed: 0,id,type
0,5628bd6c,1
1,a1fcba39,1
2,f56c9501,0
3,9969afdd,1
4,b356eeae,0


In [12]:
alert_transactions = transaction_labels[transaction_labels.is_sar ==1]
alert_transactions.head()

Unnamed: 0,source,target,tran_id,tran_timestamp,is_sar
41322,5e7442f1,0bffd1da,11873,2020-01-09 00:00:00+00:00,1
62128,65c7b5a1,0bffd1da,11874,2020-01-09 00:00:00+00:00,1
57575,04128f28,0bffd1da,11875,2020-01-09 00:00:00+00:00,1
85346,462568a4,0bffd1da,13151,2020-01-10 00:00:00+00:00,1
59894,0bffd1da,4a1c2abc,23148,2020-01-17 00:00:00+00:00,1


In [13]:
alert_transactions = transaction_labels[transaction_labels.is_sar ==1]

alert_sources = alert_transactions[["source", "tran_timestamp"]]
alert_sources.columns = ["id", "tran_timestamp"]
alert_sources.head()
alert_targets = alert_transactions[["target", "tran_timestamp"]]
alert_targets.columns = ["id", "tran_timestamp"]

sar_party = pd.concat([alert_sources, alert_targets], ignore_index=True)

sar_party.sort_values(["id", "tran_timestamp"], ascending = [False, True])

# find a 1st occurence of sar per id
sar_party = sar_party.iloc[[sar_party.id.eq(id).idxmax() for id in sar_party['id'].value_counts().index]]
sar_party = sar_party.groupby([pd.Grouper(key='tran_timestamp', freq='M'), 'id']).agg(monthly_count=('id','count'))
sar_party = sar_party.reset_index(level=["id"])
sar_party = sar_party.reset_index(level=["tran_timestamp"])
sar_party.drop(["monthly_count"], axis=1, inplace=True)

sar_party["is_sar"] = sar_party["is_sar"] = 1
sar_party

Unnamed: 0,tran_timestamp,id,is_sar
0,2020-01-31 00:00:00+00:00,04128f28,1
1,2020-01-31 00:00:00+00:00,0bffd1da,1
2,2020-01-31 00:00:00+00:00,101c8a90,1
3,2020-01-31 00:00:00+00:00,462568a4,1
4,2020-01-31 00:00:00+00:00,4a1c2abc,1
...,...,...,...
811,2021-12-31 00:00:00+00:00,bc0ed20b,1
812,2021-12-31 00:00:00+00:00,e5cd7b59,1
813,2021-12-31 00:00:00+00:00,eac8b9c1,1
814,2021-12-31 00:00:00+00:00,f30f26a2,1


In [14]:
party_labels = party.merge(sar_party, on=["id"], how="left")
party_labels.is_sar = party_labels.is_sar.map({1.0: 1, np.nan: 0})
max_time_stamp = datetime.datetime.utcfromtimestamp(int(max(transaction_labels.tran_timestamp.values))/1e9)
party_labels = party_labels.fillna(max_time_stamp)

In [15]:
party_labels.head(5)

Unnamed: 0,id,type,tran_timestamp,is_sar
0,5628bd6c,1,2021-12-20 00:00:00,0
1,a1fcba39,1,2021-12-20 00:00:00,0
2,f56c9501,0,2021-12-20 00:00:00,0
3,9969afdd,1,2021-12-20 00:00:00,0
4,b356eeae,0,2021-12-20 00:00:00,0


# Neo4j

In [16]:
from graphdatascience import GraphDataScience
import math

def convertToNumber(s):
    return int.from_bytes(s.encode(), 'little')

def convertFromNumber(n):
    return n.to_bytes(math.ceil(n.bit_length() / 8), 'little').decode()

In [17]:
gds = GraphDataScience('bolt://localhost:7687', auth=('neo4j', 'hopsworks'))

In [18]:
def pupulate_graph(input_df):
    # Neo4j node formatting ##################
    
    # Extract unique nodes visited
    nodes = party_labels[party_labels.id.isin(input_df.source) | (party_labels.id.isin(input_df.target))]
    nodes = nodes[['id']]    
    nodes = nodes.rename(columns={"id": "nodeId"},
                         errors="raise")
    
    # Convert node ID to a positive integer (Neo4j requirements)
    nodes['nodeId'] = [convertToNumber(nodeId) for nodeId in nodes['nodeId']]
    nodes['nodeId'] = nodes['nodeId'].astype(int)

    # Neo4j relationships formatting ##################
    
    relationships = input_df[['source', 'target', 'tran_id']]
    relationships = relationships.rename(columns={"source": "sourceNodeId",
                                                  "target": "targetNodeId"},
                                         errors="raise")

    # Convert source node ID to a positive integer (Neo4j requirements)
    relationships['sourceNodeId'] = [convertToNumber(sourceNodeId) for sourceNodeId in relationships['sourceNodeId']]
    relationships['sourceNodeId'] = relationships['sourceNodeId'].astype(int)

    # Convert target node ID to a positive integer (Neo4j requirements)
    relationships['targetNodeId'] = [convertToNumber(targetNodeId) for targetNodeId in relationships['targetNodeId']]
    relationships['targetNodeId'] = relationships['targetNodeId'].astype(int)

    ##################
    
    # Build Graph
    G = gds.graph.construct("transactions-graph", nodes, relationships)

    # Check if the number of nodes is correctly stored
    assert G.node_count() == len(nodes)

    # Compute embeddings
    graph_embdeddings_df = gds.node2vec.stream(G)
    
    G.drop()

    # Convert integer node ID back to original ID
    graph_embdeddings_df['nodeId'] = [convertFromNumber(nodeId) for nodeId in graph_embdeddings_df['nodeId']]

    return {"id": graph_embdeddings_df.nodeId.to_numpy(), "graph_embeddings": graph_embdeddings_df.embedding.to_numpy()}

In [19]:
tqdm.pandas()
transaction_graphs_by_month = transaction_labels.groupby(pd.Grouper(key='tran_timestamp', freq='M')).progress_apply(lambda x: construct_graph(x))       

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:32<00:00,  1.34s/it]


In [20]:
timestamps = transaction_graphs_by_month.index.values
graph_embeddings = transaction_graphs_by_month.tolist()

In [21]:
graph_embdeddings_df = pd.DataFrame()
for timestamp, graph_embedding in zip(timestamps, graph_embeddings):
    df_tmp = pd.DataFrame(graph_embedding)
    df_tmp["tran_timestamp"] = timestamp
    graph_embdeddings_df = pd.concat([graph_embdeddings_df, df_tmp])    
graph_embdeddings_df.head(5)

Unnamed: 0,id,graph_embeddings,tran_timestamp
0,5628bd6c,"[-0.03805919736623764, 0.08318281173706055, 0....",2020-01-31
1,a1fcba39,"[-0.0054939440451562405, 0.010180824436247349,...",2020-01-31
2,f56c9501,"[-0.04437978193163872, 0.09381019324064255, 0....",2020-01-31
3,9969afdd,"[-0.0032248746138066053, 0.018540577962994576,...",2020-01-31
4,b356eeae,"[-0.010767131112515926, 0.026746975257992744, ...",2020-01-31


#### Convert date time to unix epoc milliseconds 

In [22]:
transaction_labels.tran_timestamp = transaction_labels.tran_timestamp.values.astype(np.int64) // 10 ** 6
graph_embdeddings_df.tran_timestamp = graph_embdeddings_df.tran_timestamp.values.astype(np.int64) // 10 ** 6
party_labels.tran_timestamp = party_labels.tran_timestamp.map(lambda x: datetime.datetime.timestamp(x) * 1000)
party_labels.tran_timestamp = party_labels.tran_timestamp.values.astype(np.int64)

transaction_labels['month'] = pd.to_datetime(transaction_labels['tran_timestamp'], unit='ms').dt.month

In [23]:
# Extract unique nodes
nodes = pd.DataFrame({'id': party_labels['id'].drop_duplicates()})

nodes = nodes.rename(columns={"id": "nodeId"},
                     errors="raise")

nodes['nodeId'] = [convertToNumber(nodeId) for nodeId in nodes['nodeId']]
nodes['nodeId'] = nodes['nodeId'].astype(int)

In [24]:
nodes.head(5)

Unnamed: 0,nodeId
0,7149011831509628469
1,4121745159176466785
2,3544391427334485350
3,7234019469221574969
4,7305231555947213666


In [None]:
len(nodes)

In [None]:
# Rename transactions into relationships
relationships = transaction_labels[['source', 'target', 'tran_id']]

relationships = relationships.rename(columns={"source": "sourceNodeId",
                                              "target": "targetNodeId"},
                                     errors="raise")

relationships['sourceNodeId'] = [convertToNumber(sourceNodeId) for sourceNodeId in relationships['sourceNodeId']]
relationships['sourceNodeId'] = relationships['sourceNodeId'].astype(int)

relationships['targetNodeId'] = [convertToNumber(targetNodeId) for targetNodeId in relationships['targetNodeId']]
relationships['targetNodeId'] = relationships['targetNodeId'].astype(int)

In [None]:
relationships.head(5)

In [None]:
len(relationships)

In [None]:
G = gds.graph.construct("transactions-graph", nodes, relationships)

In [None]:
gds.graph.list()

In [None]:
# Check if number of nodes is correctly stored
G.node_count() == len(nodes)

In [None]:
# When the graph is no longer needed, it should be dropped to free up memory
#G.drop()

In [None]:
graph_embdeddings_df = gds.node2vec.stream(G)

In [None]:
graph_embdeddings_df.head()

In [None]:
graph_embdeddings_df['nodeId'] = [convertFromNumber(nodeId) for nodeId in graph_embdeddings_df['nodeId']]

graph_embdeddings_df = graph_embdeddings_df.rename(columns={"nodeId": "id",
                                                           "embedding": "graph_embeddings"},
                                                   errors="raise")

In [None]:
graph_embdeddings_df.head()

#### Compute time evolving graph embeddings

In [None]:
transaction_graphs_by_month = transaction_labels.groupby(pd.Grouper(key='tran_timestamp', freq='M')).apply(lambda x: construct_graph(x))       

In [None]:
timestamps = transaction_graphs_by_month.index.values
graph_embeddings = transaction_graphs_by_month.tolist()

In [None]:
graph_embdeddings_df = pd.DataFrame()
for timestamp, graph_embedding in zip(timestamps, graph_embeddings):
    df_tmp = pd.DataFrame(graph_embedding)
    df_tmp["tran_timestamp"] = timestamp
    graph_embdeddings_df = pd.concat([graph_embdeddings_df, df_tmp])    
graph_embdeddings_df.head(5)

#### Convert date time to unix epoc milliseconds 

In [None]:
transaction_labels.tran_timestamp = transaction_labels.tran_timestamp.values.astype(np.int64) // 10 ** 6
graph_embdeddings_df.tran_timestamp = graph_embdeddings_df.tran_timestamp.values.astype(np.int64) // 10 ** 6
party_labels.tran_timestamp = party_labels.tran_timestamp.map(lambda x: datetime.datetime.timestamp(x) * 1000)
party_labels.tran_timestamp = party_labels.tran_timestamp.values.astype(np.int64)

---

# 👮🏼‍♀️ Data Validation 

Before you define [feature groups](https://docs.hopsworks.ai/latest/generated/feature_group/) lets define [validation rules](https://docs.hopsworks.ai/latest/generated/feature_validation/) for features. You do expect some of the features to comply with certain *rules* or *expectations*. For example: a transacted amount must be a positive value. In the case of a transacted amount arriving as a negative value you can decide whether to stop it to `write` into a feature group and throw an error or allow it to be written but provide a warning. In the next section you will create feature store `expectations`, attach them to feature groups, and apply them to dataframes being appended to said feature group.

#### Data validation with Greate Expectations in Hopsworks
You can use GE library for validation in Hopsworks features store. 

##  <img src="../images/icon102.png" width="18px"></img> Hopsworks feature store

The Hopsworks feature feature store library is Apache V2 licensed and available [here](https://github.com/logicalclocks/feature-store-api). The library is currently available for Python and JVM languages such as Scala and Java.
In this notebook, we are going to cover Python part.

You can find the complete documentation of the library here: 

The first step is to establish a connection with your Hopsworks feature store instance and retrieve the object that represents the feature store you'll be working with. 

> By default `connection.get_feature_store()` returns the feature store of the project we are working with. However, it accepts also a project name as parameter to select a different feature store.

In [25]:
import hopsworks

project = hopsworks.login()

# Get the feature store handle for the project's feature store
fs = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.

Multiple projects found. 

	 (1) marco
	 (2) quickstart_shared



Enter project to access:  1



Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/397461
Connected. Call `.close()` to terminate connection gracefully.


### 🔬 Expectations suite

In [26]:
# Define Expectation Suite - no use of HSFS
import great_expectations as ge
from pprint import pprint
import json

expectation_suite = ge.core.ExpectationSuite(expectation_suite_name="aml_project_validations")
pprint(expectation_suite.to_json_dict(), indent=2)

{ 'data_asset_type': None,
  'expectation_suite_name': 'aml_project_validations',
  'expectations': [],
  'ge_cloud_id': None,
  'meta': {'great_expectations_version': '0.14.13'}}


In [27]:
expectation_suite.add_expectation(
  ge.core.ExpectationConfiguration(
  expectation_type="expect_column_max_to_be_between",
  kwargs={"column": "monthly_in_count", "min_value": 0, "max_value": 10000000}) 
)

{"meta": {}, "expectation_type": "expect_column_max_to_be_between", "kwargs": {"column": "monthly_in_count", "min_value": 0, "max_value": 10000000}}

In [28]:
pprint(expectation_suite)

{
  "meta": {
    "great_expectations_version": "0.14.13"
  },
  "expectations": [
    {
      "meta": {},
      "expectation_type": "expect_column_max_to_be_between",
      "kwargs": {
        "column": "monthly_in_count",
        "min_value": 0,
        "max_value": 10000000
      }
    }
  ],
  "ge_cloud_id": null,
  "expectation_suite_name": "aml_project_validations",
  "data_asset_type": null
}


---

## <span style="color:#ff5f27;"> 🪄 Register Feature Groups </span>

### Feature Groups

A `Feature Groups` is a logical grouping of features, and experience has shown, that this grouping generally originates from the features being derived from the same data source. The `Feature Group` lets you save metadata along features, which defines how the Feature Store interprets them, combines them and reproduces training datasets created from them.

Generally, the features in a feature group are engineered together in an ingestion job. However, it is possible to have additional jobs to append features to an existing feature group. Furthermore, `feature groups` provide a way of defining a namespace for features, such that you can define features with the same name multiple times, but uniquely identified by the group they are contained in.

> It is important to note that `feature groups` are not groupings of features for immediate training of Machine Learning models. Instead, to ensure reusability of features, it is possible to combine features from any number of groups into training datasets.

#### Transactions monthly aggregates feature group

In [31]:
transactions_fg = fs.get_or_create_feature_group(
    name = "transactions_monthly_aml_fg",
    version = 1,
    primary_key = ["id"],
    partition_key = ["tran_timestamp"],   
    description = "transactions monthly aggregates features",
    event_time = 'tran_timestamp',
    online_enabled = True,
    statistics_config = {"enabled": True, "histograms": True, "correlations": True, "exact_uniqueness": False},
    expectation_suite=expectation_suite
)   

transactions_fg.insert(in_out_df)



Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/397461/fs/393284/fg/514422
2024-02-12 13:46:05,845 INFO: 	1 expectation(s) included in expectation_suite.
Validation succeeded.
Validation Report saved successfully, explore a summary at https://c.app.hopsworks.ai:443/p/397461/fs/393284/fg/514422


Uploading Dataframe: 0.00% |          | Rows 0/170876 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: transactions_monthly_aml_fg_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/397461/jobs/named/transactions_monthly_aml_fg_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x2efc8ba00>,
 {
   "meta": {
     "great_expectations_version": "0.14.13",
     "expectation_suite_name": "aml_project_validations",
     "run_id": {
       "run_name": null,
       "run_time": "2024-02-12T12:46:05.844927+00:00"
     },
     "batch_kwargs": {
       "ge_batch_id": "afed250c-c9a4-11ee-811f-a6e5112dcb0c"
     },
     "batch_markers": {},
     "batch_parameters": {},
     "validation_time": "20240212T124605.844776Z",
     "expectation_suite_meta": {
       "great_expectations_version": "0.14.13"
     }
   },
   "results": [
     {
       "expectation_config": {
         "meta": {
           "expectationId": 322575
         },
         "expectation_type": "expect_column_max_to_be_between",
         "kwargs": {
           "column": "monthly_in_count",
           "min_value": 0,
           "max_value": 10000000
         }
       },
       "meta": {
         "ingestionResult": "INGESTED",
         "validationTime": "2024-02-12T12:46:05.000844Z"
       

#### Alert Transaction labels feature group

In [37]:
transaction_labels_fg = fs.get_or_create_feature_group(
    name = "transaction_labels_aml_fg",
    version = 1,
    primary_key = ["tran_id"],
    partition_key = ["month"],         
    description = "alert transactions",
    event_time = 'tran_timestamp',    
    online_enabled = True,                                                
    statistics_config = {"enabled": True, "histograms": True, "correlations": True, "exact_uniqueness": False}
)

transaction_labels_fg.insert(transaction_labels)



Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/397461/fs/393284/fg/514424


Uploading Dataframe: 0.00% |          | Rows 0/438386 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: transaction_labels_aml_fg_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/397461/jobs/named/transaction_labels_aml_fg_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x2f11cd5d0>, None)

#### Party feature group

In [38]:
party_fg = fs.get_or_create_feature_group(
    name = "party_aml_fg",
    version = 1,
    primary_key = ["id"],
    description = "party fg with labels",
    event_time = 'tran_timestamp',        
    online_enabled = True,
    statistics_config = {"enabled": True, "histograms": True, "correlations": True, "exact_uniqueness": False}
)

party_fg.insert(party_labels)

Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/397461/fs/393284/fg/516438


Uploading Dataframe: 0.00% |          | Rows 0/7347 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: party_aml_fg_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/397461/jobs/named/party_aml_fg_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x2f120e7d0>, None)

#### Graph embeddings feature group

In [40]:
graph_embeddings_fg = fs.get_or_create_feature_group(name="graph_embeddings_aml_fg",
                                       version=1,
                                       primary_key=["id"],
                                       description="node embeddings from transactions graph",
                                       event_time = 'tran_timestamp',     
                                       online_enabled=True,                                                
                                       statistics_config={"enabled": False, "histograms": False, "correlations": False, "exact_uniqueness": False}
                                       )

graph_embeddings_fg.insert(graph_embdeddings_df)



Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/397461/fs/393284/fg/514425


Uploading Dataframe: 0.00% |          | Rows 0/170876 | Elapsed Time: 00:00 | Remaining Time: ?

Launching job: graph_embeddings_aml_fg_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai/p/397461/jobs/named/graph_embeddings_aml_fg_1_offline_fg_materialization/executions


(<hsfs.core.job.Job at 0x2f127f310>, None)

---

## <span style="color:#ff5f27;"> 👓 Exploration </span>

### Feature groups are now accessible and searchable in the UI
![fg-overview](images/fg_explore.gif)

## 📊 Statistics
We can explore feature statistics in the feature groups. If statistics was not enabled when feature group was created then this can be done by:

```python
transactions_fg = fs.get_or_create_feature_group(
    name = "transactions_monthly_fg", 
    version = 1)

transactions_fg.statistics_config = {
    "enabled": True,
    "histograms": True,
    "correlations": True
}

transactions_fg.update_statistics_config()
transactions_fg.compute_statistics()
```

![fg-stats](images/freature_group_stats.gif)

## <span style="color:#ff5f27;"> ⏭️ **Next:** Part 02 </span>
    
In the following notebook you will use feature groups to create feature viewa and training dataset.