## Fraud Tutorial - Feature Groups

In this series of tutorials, we will work with data related to credit card transactions. The end goal is to train and serve a model on the Hopsworks platform that can predict whether a credit card transaction is fraudulent or not.

In this particular tutorial you will learn how to:
- Connect to the Hopsworks feature store.
- Create feature groups and upload them to the feature store.

### Data

The data we will use comes from three different CSV files:

- `credit_cards.csv`: credit card information such as expiration date and provider.
- `transactions.csv`: transaction information such as timestamp, location, and the amount. Importantly, the binary `fraud_label` variable tells us whether a transaction was fraudulent or not.
- `profiles.csv`: credit card user information such as birthdate and city of residence.

We can conceptualize these CSV files as originating from separate data sources. All three files have a credit card number column `cc_num` in common, which we will use later for joins related to feature engineering and dataset creation.

Let's go ahead and load the data.

In [7]:
import pandas as pd

credit_cards_df = pd.read_csv('https://raw.githubusercontent.com/logicalclocks/hopsworks-tutorials/fraud_detection/data/credit_cards.csv')
credit_cards_df.head(3)

Unnamed: 0,cc_num,provider,expires
0,4031433455074417,visa,02/25
1,4436104537406320,visa,08/24
2,4571305563689391,visa,02/21


In [9]:
profiles_df = pd.read_csv('https://raw.githubusercontent.com/logicalclocks/hopsworks-tutorials/fraud_detection/data/profiles.csv', parse_dates=["birthdate"])
profiles_df.head(3)

Unnamed: 0,name,sex,mail,birthdate,City,Country,cc_num
0,Teresa Smith,F,kevin70@yahoo.com,1972-11-07,Camarillo,US,4031433455074417
1,Luis Hays,M,kevinstewart@hotmail.com,1995-07-28,Troutdale,US,4436104537406320
2,Kyle Clark,M,davidflores@gmail.com,1954-12-30,Fort Washington,US,4571305563689391


In [10]:
# TODO fix the error:
# ValueError: Multiple files found in ZIP file. Only one file per ZIP: ['transactions.csv', '__MACOSX/._transactions.csv']

trans_df = pd.read_csv(
    'https://raw.githubusercontent.com/logicalclocks/hopsworks-tutorials/fraud_detection/data/transactions.csv.zip',
    parse_dates=["datetime"])

#### Connect to the Feature Store

We start by connecting to our feature store.

In [11]:
import hsfs

conn = hsfs.connection()
fs = conn.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.


#### Creating Feature Groups

A [feature group](https://docs.hopsworks.ai/feature-store-api/latest/generated/feature_group/) can be seen as a collection of conceptually related features that typically originate from the same data source. In our problem setting, we can imagine that our transaction data comes from a real-time stream, whereas the profile and credit card data streams are much more seldom updated. Here it makes sense to create a feature group for each csv data file.

To create a feature group we need to give it a name and specify a primary key. The primary key is the set of features that are used to uniquely identify a row in the feature group, e.g. transaction ID `tid` for the transaction data.

Let's start by creating a feature group for the credit card data.

In [12]:
credit_cards_fg = fs.create_feature_group(
    name="credit_cards",
    description="Credit card user information.",
    primary_key=["cc_num"],
    online_enabled=True
)

By setting `online_enabled=True` we enable low latency access to the data. A full list of arguments can be found in the [documentation](https://docs.hopsworks.ai/feature-store-api/latest/generated/api/feature_store_api/#create_feature_group).

At this point, we have only specified some basic metadata for the feature group. It does not store any data or even have a schema defined for it. To make the feature group persistent we populate it with its associated data using the `save` function.

In [13]:
credit_cards_fg.save(credit_cards_df)

Configuring ingestion job...
Uploading Pandas dataframe...
Launching ingestion job...
Ingestion Job started successfully, you can follow the progress at https://hopsworks.glassfish.service.consul:8182/p/125/jobs/named/credit_cards_1_insert_fg_25042022201841/executions




<hsfs.core.job.Job at 0x7fe741bd9bb0>

We repeat the process for the transaction and user profile data.

In [14]:
# TODO do we really need event_time??
# Not sure if we will actually do a point in time split...,
# or if it's needed for the chronological split.

trans_fg = fs.create_feature_group(
    name="transactions",
    description="Transaction data.",
    primary_key=['tid'],
    online_enabled=True,
    event_time=['datetime']
)
trans_fg.save(trans_df)

profiles_fg = fs.create_feature_group(
    name="profiles",
    description="Credit card user information.",
    primary_key=["cc_num"],
    online_enabled=True
)
profiles_fg.save(profiles_df)

Configuring ingestion job...
Uploading Pandas dataframe...
Launching ingestion job...
Ingestion Job started successfully, you can follow the progress at https://hopsworks.glassfish.service.consul:8182/p/125/jobs/named/transactions_1_insert_fg_25042022202441/executions




Configuring ingestion job...
Uploading Pandas dataframe...
Launching ingestion job...
Ingestion Job started successfully, you can follow the progress at https://hopsworks.glassfish.service.consul:8182/p/125/jobs/named/profiles_1_insert_fg_25042022202725/executions




<hsfs.core.job.Job at 0x7fe740d0c4c0>

You should now be able to inspect the feature groups in the Hopsworks UI.

### Next Steps

In this notebook, we created feature groups containing raw features. In the next notebook, we will do some feature engineering on these features to create additional features and feature groups.