# Part 2: Training

In this part you learn how to use MLRun's **Feature Store** to easily define a **Feature Vector** and create the dataset you need to run the training process.  
By the end of this tutorial you’ll learn how to:
- Combine multiple data sources to a single feature vector
- Create training dataset
- Create a model using an MLRun hub function

In [1]:
project_name = 'fraud-demo'

In [2]:
import mlrun

# Initialize the MLRun project object
project = mlrun.get_or_create_project(project_name, context="./", user_project=True)

> 2023-06-20 13:44:51,471 [info] loaded project fraud-demo from MLRun DB


## Step 1 - Create a feature vector  
In this section you create a feature vector.  
The Feature vector has a `name` so you can reference to it later via the URI or your serving function, and it has a list of 
`features` from the available feature sets.  You can add a feature from a feature set by adding `<FeatureSet>.<Feature>` to 
the list, or add `<FeatureSet>.*` to add all the feature set's available features.  

By default, the first FeatureSet in the feature list acts as the spine, meaning that all the other features are joined to it.  
For example, in this instance you use the early sense sensor data as the spine, so for each early sense event you create produces a row in the resulted feature vector.

In [3]:
# Define the list of features to use
features = ['events.*',
            'transactions.amount_max_2h', 
            'transactions.amount_sum_2h', 
            'transactions.amount_count_2h',
            'transactions.amount_avg_2h', 
            'transactions.amount_max_12h', 
            'transactions.amount_sum_12h',
            'transactions.amount_count_12h', 
            'transactions.amount_avg_12h', 
            'transactions.amount_max_24h',
            'transactions.amount_sum_24h', 
            'transactions.amount_count_24h', 
            'transactions.amount_avg_24h',
            'transactions.es_transportation_sum_14d', 
            'transactions.es_health_sum_14d',
            'transactions.es_otherservices_sum_14d', 
            'transactions.es_food_sum_14d',
            'transactions.es_hotelservices_sum_14d', 
            'transactions.es_barsandrestaurants_sum_14d',
            'transactions.es_tech_sum_14d', 
            'transactions.es_sportsandtoys_sum_14d',
            'transactions.es_wellnessandbeauty_sum_14d', 
            'transactions.es_hyper_sum_14d',
            'transactions.es_fashion_sum_14d', 
            'transactions.es_home_sum_14d', 
            'transactions.es_travel_sum_14d', 
            'transactions.es_leisure_sum_14d',
            'transactions.gender_F',
            'transactions.gender_M',
            'transactions.step', 
            'transactions.amount', 
            'transactions.timestamp_hour',
            'transactions.timestamp_day_of_week']

In [4]:
# Import MLRun's Feature Store
import mlrun.feature_store as fstore

# Define the feature vector name for future reference
fv_name = 'transactions-fraud'

# Define the feature vector using the feature store (fstore)
transactions_fv = fstore.FeatureVector(fv_name, 
                          features, 
                          label_feature="labels.label",
                          description='Predicting a fraudulent transaction')

# Save the feature vector in the feature store
transactions_fv.save()

## Step 2 - Preview the feature vector data

Obtain the values of the features in the feature vector, to ensure the data appears as expected.

In [5]:
# Import the Parquet Target so you can directly save your dataset as a file
from mlrun.datastore.targets import ParquetTarget

# Get offline feature vector as dataframe and save the dataset to parquet
train_dataset = fstore.get_offline_features(fv_name, target=ParquetTarget())

> 2023-06-20 13:44:53,467 [info] wrote target: {'name': 'parquet', 'kind': 'parquet', 'path': 'v3io:///projects/fraud-demo-admin/FeatureStore/transactions-fraud/parquet/vectors/transactions-fraud-latest.parquet', 'status': 'ready', 'updated': '2023-06-20T13:44:53.467002+00:00', 'size': 151474, 'partitioned': True}


In [6]:
# Preview your dataset
train_dataset.to_dataframe()

Unnamed: 0,event_password_change,event_details_change,event_login,amount_max_2h,amount_sum_2h,amount_count_2h,amount_avg_2h,amount_max_12h,amount_sum_12h,amount_count_12h,...,es_home_sum_14d,es_travel_sum_14d,es_leisure_sum_14d,gender_F,gender_M,step,amount,timestamp_hour,timestamp_day_of_week,label
0,0,0,1,1.83,1.83,1.0,1.830000,1.83,1.83,1.0,...,0.0,0.0,0.0,0.0,1.0,72.0,1.83,13.0,6.0,0.0
1,0,0,1,18.72,40.22,3.0,13.406667,18.72,40.22,3.0,...,0.0,0.0,0.0,0.0,1.0,66.0,18.72,13.0,6.0,0.0
2,1,0,0,25.92,64.86,3.0,21.620000,25.92,64.86,3.0,...,0.0,0.0,0.0,0.0,1.0,27.0,25.92,13.0,6.0,0.0
3,1,0,0,24.75,30.17,2.0,15.085000,24.75,30.17,2.0,...,0.0,0.0,0.0,0.0,1.0,141.0,24.75,13.0,6.0,0.0
4,1,0,0,64.18,65.17,2.0,32.585000,64.18,65.17,2.0,...,0.0,0.0,0.0,1.0,0.0,124.0,64.18,13.0,6.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1763,1,0,0,45.28,144.56,5.0,28.912000,161.75,1017.80,33.0,...,0.0,1.0,0.0,1.0,0.0,96.0,24.02,13.0,1.0,0.0
1764,0,1,0,26.81,47.75,2.0,23.875000,68.16,652.19,23.0,...,0.0,0.0,0.0,0.0,1.0,134.0,26.81,13.0,1.0,0.0
1765,0,0,1,33.10,91.11,4.0,22.777500,121.96,1001.32,32.0,...,2.0,0.0,0.0,1.0,0.0,141.0,14.95,12.0,1.0,0.0
1766,1,0,0,71.63,182.18,7.0,26.025714,71.63,1288.32,44.0,...,0.0,0.0,0.0,0.0,1.0,101.0,13.62,12.0,1.0,0.0


## Step 3 - Train models and choose the highest accuracy

With MLRun, you can easily train different models and compare the results. In the code below, you train three different models.
Each one uses a different algorithm (random forest, XGBoost, adabost), and you choose the model with the highest accuracy.

In [7]:
# Import the Sklearn classifier function from the functions hub
classifier_fn = mlrun.import_function('hub://auto_trainer')

In [9]:
# Prepare the parameters list for the training function
# you use 3 different models
training_params = {"model_name": ['transaction_fraud_rf', 
                                  'transaction_fraud_xgboost', 
                                  'transaction_fraud_adaboost'],
              
                  "model_class": ['sklearn.ensemble.RandomForestClassifier',
                                  'sklearn.ensemble.GradientBoostingClassifier',
                                  'sklearn.ensemble.AdaBoostClassifier']}

# Define the training task, including your feature vector, label and hyperparams definitions
train_task = mlrun.new_task('training', 
                      inputs={'dataset': transactions_fv.uri},
                      params={'label_columns': 'label'}
                     )

train_task.with_hyper_params(training_params, strategy='list', selector='max.accuracy')

# Specify your cluster image
classifier_fn.spec.image = 'mlrun/mlrun'

# Run training
classifier_fn.run(train_task, local=False)

> 2023-06-20 13:48:48,291 [info] Storing function: {'name': 'training', 'uid': '29893c70cb6b433fa3b0a0ffc34082aa', 'db': 'http://mlrun-api:8080'}
> 2023-06-20 13:48:48,567 [info] Job is running in the background, pod: training-2vp4s
> 2023-06-20 13:48:54,040 [info] Storing function: {'name': 'training', 'uid': '29893c70cb6b433fa3b0a0ffc34082aa', 'db': None}
> 2023-06-20 13:48:55,930 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2023-06-20 13:48:57,399 [info] label columns: label
> 2023-06-20 13:48:57,399 [info] Sample set not given, using the whole training set as the sample set
> 2023-06-20 13:48:57,640 [info] training 'transaction_fraud_rf'

The normalize argument is deprecated in v1.1 and will be removed in v1.3. Explicitly normalizing y_prob will reproduce this behavior, but it is recommended that a proper probability is used (i.e. a classifier's `predict_proba` positive class or `decision_function` output calibrated with `Calibra

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
fraud-demo-admin,...c34082aa,0,Jun 20 13:48:54,completed,training,v3io_user=adminkind=owner=adminmlrun/client_version=1.4.0-rc7mlrun/client_python_version=3.9.16,dataset,label_columns=label,best_iteration=1accuracy=0.9971751412429378f1_score=0.8precision_score=1.0recall_score=0.6666666666666666,feature-importancetest_setconfusion-matrixroc-curvescalibration-curvemodeliteration_resultsparallel_coordinates





> 2023-06-20 13:49:14,081 [info] run executed, status=completed: {'name': 'training'}


<mlrun.model.RunObject at 0x7ff6fab95a60>

## Step 4 - Perform feature selection

As part of the data science process, try to reduce the training dataset's size to get rid of bad or unuseful features and save computation time.

Use your ready-made feature selection function from MLRun's [`hub://feature_selection`](https://github.com/mlrun/functions/blob/development/feature_selection/feature_selection.ipynb) to select the best features to keep on a sample from your dataset, and run the function on that.


In [10]:
feature_selection_fn = mlrun.import_function('hub://feature_selection')

feature_selection_run = feature_selection_fn.run(
            params={"k": 18,
                    "min_votes": 2,
                    "label_column": 'label',
                    'output_vector_name':fv_name + "-short",
                    'ignore_type_errors': True},
    
            inputs={'df_artifact': transactions_fv.uri},
            name='feature_extraction',
            handler='feature_selection',
    local=False)

> 2023-06-20 13:49:57,932 [info] Storing function: {'name': 'feature-extraction', 'uid': 'df259a74d5254b409c40d8f864846bce', 'db': 'http://mlrun-api:8080'}


Names with underscore '_' are about to be deprecated, use dashes '-' instead. Replacing underscores with dashes.


> 2023-06-20 13:49:58,216 [info] Job is running in the background, pod: feature-extraction-xhnpp
> 2023-06-20 13:52:55,719 [info] Storing function: {'name': 'feature-extraction', 'uid': 'df259a74d5254b409c40d8f864846bce', 'db': None}
Liblinear failed to converge, increase the number of iterations.
> 2023-06-20 13:53:04,663 [info] votes needed to be selected: 2
> 2023-06-20 13:53:06,179 [info] wrote target: {'name': 'parquet', 'kind': 'parquet', 'path': 'v3io:///projects/fraud-demo-admin/FeatureStore/transactions-fraud-short/parquet/vectors/transactions-fraud-short-latest.parquet', 'status': 'ready', 'updated': '2023-06-20T13:53:06.179691+00:00', 'size': 628111, 'partitioned': True}
> 2023-06-20 13:53:06,895 [info] run executed, status=completed: {'name': 'feature-extraction'}
final state: completed


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
fraud-demo-admin,...64846bce,0,Jun 20 13:52:55,completed,feature-extraction,v3io_user=adminkind=owner=adminmlrun/client_version=1.4.0-rc7mlrun/client_python_version=3.9.16host=feature-extraction-xhnpp,df_artifact,k=18min_votes=2label_column=labeloutput_vector_name=transactions-fraud-shortignore_type_errors=True,top_features_vector=store://feature-vectors/fraud-demo-admin/transactions-fraud-short,f_classifmutual_info_classifchi2f_regressionLinearSVCLogisticRegressionExtraTreesClassifierfeature_scoresmax_scaled_scores_feature_scoresselected_features_countselected_features





> 2023-06-20 13:53:08,521 [info] run executed, status=completed: {'name': 'feature-extraction'}


In [11]:
mlrun.get_dataitem(feature_selection_run.outputs['top_features_vector']).as_df().tail(5)

Unnamed: 0,amount_max_2h,amount_sum_2h,amount_count_2h,amount_avg_2h,amount_max_12h,amount_sum_12h,amount_count_12h,amount_avg_12h,amount_max_24h,amount_sum_24h,amount_count_24h,amount_avg_24h,es_transportation_sum_14d,es_health_sum_14d,es_otherservices_sum_14d,label
9996,31.14,31.14,1.0,31.14,31.14,31.14,1.0,31.14,119.5,330.61,5.0,66.122,0.0,7.0,0.0,0
9997,218.48,365.3,5.0,73.06,218.48,1029.85,24.0,42.910417,218.48,1927.7,58.0,33.236207,107.0,5.0,1.0,0
9998,34.93,118.22,5.0,23.644,79.16,935.26,31.0,30.169677,89.85,2062.69,68.0,30.333676,116.0,0.0,0.0,0
9999,77.76,189.08,3.0,63.026667,77.76,1099.98,35.0,31.428,95.71,2451.98,72.0,34.055278,122.0,0.0,0.0,0
10000,68.32,149.53,4.0,37.3825,81.0,835.89,30.0,27.863,81.0,1995.38,69.0,28.918551,119.0,0.0,0.0,0


## Step 5 - Train your models with top features

Following the feature selection, you train new models using the resultant features. You can observe that the accuracy 
and other results remain high,
meaning you get a model that requires less features to be accurate and thus less error-prone.

In [12]:
# Define your training task, including your feature vector, label and hyperparams definitions
ensemble_train_task = mlrun.new_task('training', 
                      inputs={'dataset': feature_selection_run.outputs['top_features_vector']},
                      params={'label_columns': 'label'}
                     )
ensemble_train_task.with_hyper_params(training_params, strategy='list', selector='max.accuracy')

classifier_fn.run(ensemble_train_task)

> 2023-06-20 13:53:08,694 [info] Storing function: {'name': 'training', 'uid': '90f25e3bc159480480cb40b4b5e4b3fe', 'db': 'http://mlrun-api:8080'}
> 2023-06-20 13:53:09,053 [info] Job is running in the background, pod: training-lwjrw
> 2023-06-20 13:53:15,187 [info] Storing function: {'name': 'training', 'uid': '90f25e3bc159480480cb40b4b5e4b3fe', 'db': None}
> 2023-06-20 13:53:17,467 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2023-06-20 13:53:18,089 [info] label columns: label
> 2023-06-20 13:53:18,089 [info] Sample set not given, using the whole training set as the sample set
> 2023-06-20 13:53:18,380 [info] training 'transaction_fraud_rf'

The normalize argument is deprecated in v1.1 and will be removed in v1.3. Explicitly normalizing y_prob will reproduce this behavior, but it is recommended that a proper probability is used (i.e. a classifier's `predict_proba` positive class or `decision_function` output calibrated with `Calibra

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
fraud-demo-admin,...b5e4b3fe,0,Jun 20 13:53:15,completed,training,v3io_user=adminkind=owner=adminmlrun/client_version=1.4.0-rc7mlrun/client_python_version=3.9.16,dataset,label_columns=label,best_iteration=1accuracy=0.9880059970014993f1_score=0.0precision_score=0.0recall_score=0.0,feature-importancetest_setconfusion-matrixroc-curvescalibration-curvemodeliteration_resultsparallel_coordinates





> 2023-06-20 13:53:34,567 [info] run executed, status=completed: {'name': 'training'}


<mlrun.model.RunObject at 0x7ff676d53a90>

## Done!

You've completed Part 2 of the model training with the feature store.
Proceed to [Part 3](03-deploy-serving-model.html) to learn how to deploy and monitor the model.