## Model Training

In this notebook, we'll train a model on the dataset we created in the previous tutorial. We will train our model using standard Python and Scikit-learn, although it could just as well be trained with other machine learning frameworks such as PySpark, TensorFlow, and PyTorch.

In [None]:
import hsfs

conn = hsfs.connection()
fs = conn.get_feature_store()

### Load Training Data

First, we'll need to fetch the training dataset that we created in the previous notebook.

In [None]:
td = fs.get_training_dataset("transactions_dataset_splitted")
X_train = td.read("train")
X_val = td.read("validation")

X_train.head()

Next, we'll one-hot encode the categorical feature `category`.

In [None]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(sparse=False)
one_hot_train = pd.DataFrame(enc.fit_transform(X_train[["category"]]))
one_hot_val = pd.DataFrame(enc.transform(X_val[["category"]]))
X_train = pd.concat([X_train.drop(columns="category"), one_hot_train], axis=1)
X_val = pd.concat([X_val.drop(columns="category"), one_hot_val], axis=1)

We will train a model to predict `fraud_label` given the rest of the features.

In [None]:
target = td.label[0] # "fraud_label"

y_train = X_train.pop(target)
y_val = X_val.pop(target)

Let's check the distribution of our target label.

In [None]:
y_train.value_counts(normalize=True)

Notice that the distribution is extremely skewed, which is natural considering that fraudulent transactions make up a tiny part of all transactions. Thus we should somehow address the class imbalance. There are many approaches for this, such as weighting the loss function, over- or undersampling, creating synthetic data, or modifying the decision threshold. In this example, we'll use the simplest method which is to just supply a class weight parameter to our learning algorithm. The class weight will affect how much importance is attached to each class, which in our case means that higher importance will be placed on positive (fraudulent) samples.

### Train Model

Next we'll train a model. Here, we set the class weight of the positive class to be twice as big as the negative class.

In [None]:
from sklearn.linear_model import LogisticRegression

# Train model.
clf = LogisticRegression(class_weight={0: 1, 1: 2}, solver='liblinear')
clf.fit(X_train, y_train)

Let's see how well it performs on our validation data.

In [None]:
from sklearn.metrics import classification_report

preds = clf.predict(X_val)

print(classification_report(y_val, preds))