## Fraud Tutorial - Feature Engineering

In this notebook you will learn how to:
- Create derived feature groups from raw feature groups.

In [1]:
import hsfs

conn = hsfs.connection()
fs = conn.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


#### Load Feature Groups
We start by loading the feature groups we created in the previous notebook.

To load a feature group we simply run `fs.get_feature_group(name, version)` where `name` and `version` is the name and version of the feature group, respectively. By default `version` is set to `1`, which means that the first version of the feature group is loaded.

In [4]:
# TODO All of these should be version one in the final version.
credit_cards_fg = fs.get_feature_group("credit_cards", 1)
profiles_fg = fs.get_feature_group("profiles", 1)
trans_fg = fs.get_feature_group("transactions", 1)

# trans_fg.show(5)

### Feature Engineering

To train a model on fraud detection we first need to design a dataset that includes informative features from our feature groups. This can be both raw features, e.g. `amount` in the transaction feature group, and engineered features.

We will create two types of features:
1. Features that aggregate data from multiple feature groups. This could for instance be the age of a customer at the time of a transaction, which combines the `birthdate` feature from the profiles feature group with the `datetime` feature from the transactions feature group. These features will be stored in a feature group named `transactions_add_info`.
2. Features that aggregate data from multiple time steps. An example of this could be the transaction frequency of a credit card in the span of a few hours. These features will be stored in a feature group named `transactions_4h_aggs` (we will use 4-hour windows in this tutorial).

Both of these feature groups will have the transaction ID `tid` as primary key, which allows us to join them with the transactions feature group when creating the dataset.

#### `transactions_add_info`

We start with creating the `transactions_add_info` feature group, which for each transaction will have the following features:
- `age_at_transaction`: The age of the customer.
- `days_until_card_expires`: Number of days until the credit card expires.

First, we make a query to get the raw features we need.

In [5]:
# Fetch the features we need.
query = profiles_fg.select(["birthdate"])\
    .join(trans_fg.select(["tid", "datetime"]), on=["cc_num"])\
    .join(credit_cards_fg.select(["expires"]), on=["cc_num"])

# Load the query results into a dataframe.
query_df = query.read()

# TODO remove once the prefix bug has been resolved.
query_df.columns = query_df.columns.str.lstrip('fg(0|1|2).')

query_df.head()

2022-04-25 20:29:22,818 INFO: USE `clean_up_featurestore`
2022-04-25 20:29:23,513 INFO: SELECT `fg2`.`birthdate`, `fg0`.`tid`, `fg0`.`datetime`, `fg1`.`expires`
FROM `clean_up_featurestore`.`profiles_1` `fg2`
INNER JOIN `clean_up_featurestore`.`transactions_1` `fg0` ON `fg2`.`cc_num` = `fg0`.`cc_num`
INNER JOIN `clean_up_featurestore`.`credit_cards_1` `fg1` ON `fg2`.`cc_num` = `fg1`.`cc_num`


Unnamed: 0,birthdate,tid,datetime,expires
0,1933-07-23,d64878049fafec2835baa6bbcc521559,2022-02-14 16:32:52,06/22
1,1988-06-22,ddcfa7ca42e680cbf832c2d5b00c0644,2022-01-19 10:54:42,01/26
2,1968-09-18,9ed26465cc8f2f3d9ee1a7bea9a02c00,2022-01-15 23:30:50,05/22
3,2002-03-09,157acb8811ee9577ef5b279149ea8642,2022-02-08 15:16:13,08/21
4,1997-02-27,087f8c8b7c3f3bcd0970abbd17741331,2022-01-13 14:51:29,06/25


We can now compute the features we are interested in.

In [6]:
import numpy as np
import pandas as pd

# Create dataframe for new feature group.
trans_add_info_df = pd.DataFrame()
trans_add_info_df["tid"] = query_df["tid"]

trans_add_info_df["age_at_transaction"] = (query_df["datetime"] - query_df["birthdate"]) / np.timedelta64(1, "Y")
trans_add_info_df["days_until_card_expires"] = (pd.to_datetime(query_df["expires"], format="%m/%y") - query_df["datetime"]) / np.timedelta64(1, "D")

trans_add_info_df.head()

Unnamed: 0,tid,age_at_transaction,days_until_card_expires
0,d64878049fafec2835baa6bbcc521559,88.567704,106.310509
1,ddcfa7ca42e680cbf832c2d5b00c0644,33.578936,1442.545347
2,9ed26465cc8f2f3d9ee1a7bea9a02c00,53.328897,105.020255
3,157acb8811ee9577ef5b279149ea8642,19.922753,-191.636262
4,087f8c8b7c3f3bcd0970abbd17741331,24.878318,1234.380914


We save the feature group to our feature store in the same way as before.

In [7]:
trans_add_info_fg = fs.create_feature_group(
    name="transactions_add_info",
    description="Additional transaction information.",
    primary_key=["tid"],
    online_enabled=True
)
trans_add_info_fg.save(trans_add_info_df)

Configuring ingestion job...
Uploading Pandas dataframe...
Launching ingestion job...
Ingestion Job started successfully, you can follow the progress at https://hopsworks.glassfish.service.consul:8182/p/125/jobs/named/transactions_add_info_1_insert_fg_25042022203152/executions




<hsfs.core.job.Job at 0x7fa592bba580>

#### `transactions_4h_aggs`

Next, we create features that for each credit card aggregate data from multiple time steps.
For each transaction we add 5 features:
- `trans_volume_mavg`: Moving average of transaction volume in the last 4 hours.
- `trans_volume_mstd`: Moving standard deviation of transaction volume in the last 4 hours.
- `trans_freq`: Transaction frequency in the last 4 hours.
- `loc_delta`: Distance between the location of two consecutive transactions of a credit card.
- `loc_delta_mavg`: Moving average of `loc_delta` in the last 4 hours.

We start by computing the first three features.

In [8]:
from math import radians

# Load data from the transaction feature group.
trans_df = trans_fg.read()

# TODO remove once the prefix bug has been resolved.
trans_df.columns = trans_df.columns.str.lstrip('fg(0|1|2).')

# Do some simple preprocessing.
trans_df.sort_values("datetime", inplace=True)
trans_df[["longitude", "latitude"]] = trans_df[["longitude", "latitude"]].applymap(radians)

2022-04-25 20:34:22,219 INFO: USE `clean_up_featurestore`
2022-04-25 20:34:22,908 INFO: SELECT `fg0`.`tid`, `fg0`.`datetime`, `fg0`.`cc_num`, `fg0`.`category`, `fg0`.`amount`, `fg0`.`latitude`, `fg0`.`longitude`, `fg0`.`city`, `fg0`.`country`, `fg0`.`fraud_label`
FROM `clean_up_featurestore`.`transactions_1` `fg0`


In [9]:
# Create dataframe for new feature group.
trans_4h_aggs_df = pd.DataFrame()
trans_4h_aggs_df["tid"] = trans_df["tid"]

window_len = "4h"
cc_group = trans_df.groupby("cc_num")

trans_4h_aggs_df['trans_volume_mavg'] = cc_group[["datetime", "amount"]]\
    .rolling(window_len, on="datetime")\
    .mean()\
    .reset_index(level=0, drop=True)\
    .drop(columns=["datetime"])

trans_4h_aggs_df['trans_volume_mstd'] = cc_group[["datetime", "amount"]]\
    .rolling(window_len, on="datetime")\
    .std()\
    .reset_index(level=0, drop=True)\
    .drop(columns=["datetime"])\
    .fillna(0)

trans_4h_aggs_df['trans_freq'] = cc_group[["datetime", "tid"]]\
    .rolling(window_len, on="datetime")\
    .count()\
    .reset_index(level=0, drop=True)\
    .drop(columns=["datetime"])

trans_4h_aggs_df.tail()

Unnamed: 0,tid,trans_volume_mavg,trans_volume_mstd,trans_freq
263950,8fcb8bb87f0edfe2464f49e0b3b49a31,63.97,0.0,1.0
406927,c615c36e298e4ce00238484a93b581ae,585.64,0.0,1.0
65483,f0150f187cc462f7f649a4a8e13af28f,48.72,0.0,1.0
299322,bbd562425595518ec5d97ca41400aef9,1.42,0.0,1.0
224380,1a348f9ce0477c264edd70cff9a46beb,11.82,0.0,1.0


Next, we compute the distance between consecutive transactions, as well as the moving average of that.
Here we use the [Haversine distance](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.haversine_distances.html?highlight=haversine#sklearn.metrics.pairwise.haversine_distances) to quantify the distance between two longitude and latitude coordinates.

In [10]:
def haversine(long, lat):
    """Compute Haversine distance between each consecutive coordinate in (long, lat)."""

    long_shifted = long.shift()
    lat_shifted = lat.shift()
    long_diff = long_shifted - long
    lat_diff = lat_shifted - lat

    a = np.sin(lat_diff/2.0)**2
    b = np.cos(lat) * np.cos(lat_shifted) * np.sin(long_diff/2.0)**2
    c = 2*np.arcsin(np.sqrt(a + b))

    return c

# Create temporary dataframe for computations.
loc_df = trans_df[["tid", "datetime", "cc_num"]].copy()

# Distance betwen each consecutive transaction of a card.
loc_df["loc_delta"] = cc_group.apply(lambda x: haversine(x["longitude"], x["latitude"]))\
    .reset_index(level=0, drop=True)\
    .fillna(0)

loc_df["loc_delta_mavg"] = loc_df.groupby("cc_num")[["datetime", "loc_delta"]]\
    .rolling(window_len, on="datetime")\
    .mean()\
    .reset_index(level=0, drop=True)\
    .drop(columns=["datetime"])

trans_4h_aggs_df["loc_delta"] = loc_df["loc_delta"]
trans_4h_aggs_df["loc_delta_mavg"] = loc_df["loc_delta_mavg"]

trans_4h_aggs_df.tail()

Unnamed: 0,tid,trans_volume_mavg,trans_volume_mstd,trans_freq,loc_delta,loc_delta_mavg
263950,8fcb8bb87f0edfe2464f49e0b3b49a31,63.97,0.0,1.0,0.00017,0.00017
406927,c615c36e298e4ce00238484a93b581ae,585.64,0.0,1.0,9.6e-05,9.6e-05
65483,f0150f187cc462f7f649a4a8e13af28f,48.72,0.0,1.0,9.8e-05,9.8e-05
299322,bbd562425595518ec5d97ca41400aef9,1.42,0.0,1.0,3.4e-05,3.4e-05
224380,1a348f9ce0477c264edd70cff9a46beb,11.82,0.0,1.0,8.9e-05,8.9e-05


In [11]:
trans_4h_aggs_fg = fs.create_feature_group(
    name=f"transactions_{window_len}_aggs",
    description=f"Aggregate transaction data over {window_len} windows.",
    primary_key=["tid"],
    online_enabled=True,
)
trans_4h_aggs_fg.save(trans_4h_aggs_df)

Configuring ingestion job...
Uploading Pandas dataframe...
Launching ingestion job...
Ingestion Job started successfully, you can follow the progress at https://hopsworks.glassfish.service.consul:8182/p/125/jobs/named/transactions_4h_aggs_1_insert_fg_25042022203515/executions




<hsfs.core.job.Job at 0x7fa58e680430>

### Next Steps

Now we have all the features we need for model training. In the next notebook, we will combine these features into a dataset that is compatible with the model we will train.