# Interactive Data Preparation


Before training the model, you should clean the data and create meaningful features that will be good predictors for the target variable (was there a fraud?). The `interactive-data-prep.ipynb` notebook demonstrates how to interactively build features for training the model. While this approach is simple, it is unsuitable for production environments with continuous data ingestion, large scale, or real-time. In the next section, you will implement the same logic for production using a feature store.

The training set is built from three datasets: credit transactions, user events, and labels indicating if there was fraud. In this example, we prepare each dataset separately and combine them later for training.

## Preparing the Credit Transaction Dataset

The following transformations create more meaningful features, which can have a more significant impact on the prediction than the raw data:
    
- Extracting the date components (hour, day of week) from the timestamp.
- One-hot encoding for the age groups, transaction category, and the gender.
- Aggregating the amount (avg., sum, count, max over 2/12/24 hour time win‐ dows).
- Aggregating the transactions per category (over 14 days time windows).


#### Building categorical features

In [1]:
import pandas as pd
from src.date_adjust import adjust_data_timespan
import mlrun

# Fetch the transactions and event datasets from mlrun data samples 
data_path = mlrun.get_sample_path("data/fraud-demo-mlrun-fs-docs/")
transactions_data = pd.read_csv(data_path + "data.csv", parse_dates=['timestamp'])

# use only first 10k
transactions_data = transactions_data.sort_values(by='source', axis=0)[:10000]

# Adjust the samples timestamp for the past 2 days
transactions_data = adjust_data_timespan(transactions_data, new_period='2d')

# Preview
transactions_data.head(3)

Unnamed: 0,step,age,gender,zipcodeOri,zipMerchant,category,amount,fraud,timestamp,source,target,device
274633,91,5,F,28007,28007,es_transportation,26.92,0,2025-01-04 11:21:26.441634000,C1022153336,M1823072687,33832bb8607545df97632a7ab02d69c4
286902,94,2,M,28007,28007,es_transportation,48.22,0,2025-01-04 11:21:44.735259913,C1006176917,M348934600,fadd829c49e74ffa86c8da3be75ada53
416998,131,3,M,28007,28007,es_transportation,17.56,0,2025-01-04 11:21:49.842429939,C1010936270,M348934600,58d0422a50bc40c89d2b4977b2f1beea


In [2]:
transactions_data.columns

Index(['step', 'age', 'gender', 'zipcodeOri', 'zipMerchant', 'category',
       'amount', 'fraud', 'timestamp', 'source', 'target', 'device'],
      dtype='object')

The next part is aggregating the transaction amounts by time windows and transaction categories, providing you with a long list of derived features that can potentially help make better predictions.

In [3]:
processed_transactions = transactions_data

# Generate day and hour columns from the timestamp
processed_transactions['day_of_week'] = processed_transactions['timestamp'].dt.weekday
processed_transactions['hour'] = processed_transactions['timestamp'].dt.hour

# Map age groups
processed_transactions["age_mapped"] = processed_transactions["age"].map(
    lambda x: {'U': '0'}.get(x, x)
)

# encode categories and gender groups (using one hot encoding)
processed_transactions = pd.get_dummies(processed_transactions, columns=['category', 'gender'])
processed_transactions.head()

Unnamed: 0,step,age,zipcodeOri,zipMerchant,amount,fraud,timestamp,source,target,device,...,category_es_hyper,category_es_leisure,category_es_otherservices,category_es_sportsandtoys,category_es_tech,category_es_transportation,category_es_travel,category_es_wellnessandbeauty,gender_F,gender_M
274633,91,5,28007,28007,26.92,0,2025-01-04 11:21:26.441634000,C1022153336,M1823072687,33832bb8607545df97632a7ab02d69c4,...,False,False,False,False,False,True,False,False,True,False
286902,94,2,28007,28007,48.22,0,2025-01-04 11:21:44.735259913,C1006176917,M348934600,fadd829c49e74ffa86c8da3be75ada53,...,False,False,False,False,False,True,False,False,False,True
416998,131,3,28007,28007,17.56,0,2025-01-04 11:21:49.842429939,C1010936270,M348934600,58d0422a50bc40c89d2b4977b2f1beea,...,False,False,False,False,False,True,False,False,False,True
334543,108,4,28007,28007,4.5,0,2025-01-04 11:22:02.135181118,C1033736586,M1823072687,30b269ae55984e5584f1dd5f642ac1a3,...,False,False,False,False,False,True,False,False,True,False
210647,72,4,28007,28007,1.83,0,2025-01-04 11:22:36.024263001,C1019071188,M348934600,97bee3503a984f59aa6139b59f933c0b,...,False,False,False,False,False,True,False,False,False,True


In [4]:
transactions_for_agg = processed_transactions.set_index(['timestamp'],)

# Group/Aggregate amount stats (mean, max, ..) by time windows
windows=['2H', '12H', '24H']
operation = ['mean','sum', 'count','max']
for window in windows:
    for op in operation:
        processed_transactions[f'amount_{op}_{window}'] = transactions_for_agg.groupby(['source', pd.Grouper(freq=window)])['amount'].transform(op).values

In [5]:
# Group/Aggregate amount stats (mean, max, ..) by transaction category
main_categories = ["es_transportation", "es_health", "es_otherservices",
       "es_food", "es_hotelservices", "es_barsandrestaurants",
       "es_tech", "es_sportsandtoys", "es_wellnessandbeauty",
       "es_hyper", "es_fashion", "es_home", "es_contents",
       "es_travel", "es_leisure"]
for category in main_categories:
    processed_transactions[f'{category}_sum_14D'] = transactions_for_agg.groupby(['source', pd.Grouper(freq='14D')])[f'category_{category}'].transform('sum').values

processed_transactions.set_index(['source'], inplace=True)
processed_transactions.head()

Unnamed: 0_level_0,step,age,zipcodeOri,zipMerchant,amount,fraud,timestamp,target,device,day_of_week,...,es_barsandrestaurants_sum_14D,es_tech_sum_14D,es_sportsandtoys_sum_14D,es_wellnessandbeauty_sum_14D,es_hyper_sum_14D,es_fashion_sum_14D,es_home_sum_14D,es_contents_sum_14D,es_travel_sum_14D,es_leisure_sum_14D
source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
C1022153336,91,5,28007,28007,26.92,0,2025-01-04 11:21:26.441634000,M1823072687,33832bb8607545df97632a7ab02d69c4,5,...,1,1,1,1,0,1,0,0,0,0
C1006176917,94,2,28007,28007,48.22,0,2025-01-04 11:21:44.735259913,M348934600,fadd829c49e74ffa86c8da3be75ada53,5,...,4,0,1,1,0,2,0,0,0,0
C1010936270,131,3,28007,28007,17.56,0,2025-01-04 11:21:49.842429939,M348934600,58d0422a50bc40c89d2b4977b2f1beea,5,...,4,0,0,6,6,0,0,0,0,0
C1033736586,108,4,28007,28007,4.5,0,2025-01-04 11:22:02.135181118,M1823072687,30b269ae55984e5584f1dd5f642ac1a3,5,...,3,2,0,1,3,0,2,0,1,0
C1019071188,72,4,28007,28007,1.83,0,2025-01-04 11:22:36.024263001,M348934600,97bee3503a984f59aa6139b59f933c0b,5,...,1,0,0,0,1,4,0,1,1,0


In [6]:
processed_transactions.dtypes

step                                       int64
age                                       object
zipcodeOri                                 int64
zipMerchant                                int64
amount                                   float64
fraud                                      int64
timestamp                         datetime64[ns]
target                                    object
device                                    object
day_of_week                                int32
hour                                       int32
age_mapped                                object
category_es_barsandrestaurants              bool
category_es_contents                        bool
category_es_fashion                         bool
category_es_food                            bool
category_es_health                          bool
category_es_home                            bool
category_es_hotelservices                   bool
category_es_hyper                           bool
category_es_leisure 

## Preparing the User Events(Activities) Dataset

The events dataset contains user activities such as login, change of details, or password, which can hint at a fraud attempt. The next part shows how to load the events dataset and create categorical features per event type.

### Processing the events dataset

In [7]:
# Fetch the user_events dataset from the server
user_events_data = pd.read_csv(data_path + "events.csv", 
                               index_col=0, quotechar="\'", parse_dates=['timestamp'])

# Adjust to the last 2 days to see the latest aggregations in the online feature vectors
user_events_data = adjust_data_timespan(user_events_data, new_period='2d')

# Preview
user_events_data.head(3)

Unnamed: 0,source,event,timestamp
45553,C137986193,password_change,2025-01-04 11:21:28.437310000
24134,C1940951230,details_change,2025-01-04 11:21:29.485492091
64444,C247537602,login,2025-01-04 11:21:31.140275103


In [8]:
# Generate categorical features from the event type
processed_events = user_events_data
processed_events = pd.get_dummies(processed_events, columns=['event'])
processed_events.set_index(['source'], inplace=True)
processed_events.head()

Unnamed: 0_level_0,timestamp,event_details_change,event_login,event_password_change
source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
C137986193,2025-01-04 11:21:28.437310000,False,False,True
C1940951230,2025-01-04 11:21:29.485492091,True,False,False
C247537602,2025-01-04 11:21:31.140275103,False,True,False
C470079617,2025-01-04 11:21:32.430724428,False,False,True
C1142118359,2025-01-04 11:21:33.221016830,False,True,False


## Extracting Labels and Training a Model

The final step is to generate a target label column (the fraud yes/no indication) and train a basic model to evaluate your assumptions. The next part demonstrates how to create the labels dataset and use sklearn to train and evaluate a basic model.

### Label df

In [9]:
def create_labels(df):
    labels = df[['fraud','timestamp']].copy()
    labels = labels.rename(columns={"fraud": "label"})
    labels['timestamp'] = labels['timestamp'].astype("datetime64[ns]")
    labels['label'] = labels['label'].astype(int)
    return labels

In [10]:
# Create the target label dataset (fraud indication)
labels_set = create_labels(processed_transactions)
labels_set.head()

Unnamed: 0_level_0,label,timestamp
source,Unnamed: 1_level_1,Unnamed: 2_level_1
C1022153336,0,2025-01-04 11:21:26.441634000
C1006176917,0,2025-01-04 11:21:44.735259913
C1010936270,0,2025-01-04 11:21:49.842429939
C1033736586,0,2025-01-04 11:22:02.135181118
C1019071188,0,2025-01-04 11:22:36.024263001


## Train

In [11]:
# Train a model based on the transactions, events, and labels
from src.train_sklearn import train_and_val, prepare_data_to_train

X_train, X_test, y_train, y_test = prepare_data_to_train(processed_transactions, processed_events, labels_set)
rf_best = train_and_val(X_train, X_test, y_train, y_test)

# print the model results (Accuracy, ..)
rf_best

Fitting 3 folds for each of 100 candidates, totalling 300 fits
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 Score: 1.0


## Done!

You've completed the second part - interactive data preparation.
Proceed to [Part 3](03-ingest-with-feature-store.ipynb) to learn how to build data ingestion services with the Feature Store.