# Part 2: Training

In this part we will show how using MLRun's **Feature Store** we can easily define a **Feature Vector** and create the dataset we need to run our training process.  
By the end of this tutorial you’ll learn how to:
- Combine multiple data sources to a single Feature Vector
- Create training dataset
- Create a model using an MLRun Hub function

In [1]:
project_name = 'fraud-demo'

In [2]:
import mlrun

# Initialize the MLRun project object
project = mlrun.get_or_create_project(project_name, context="./", user_project=True)

> 2023-02-09 15:35:11,837 [info] loaded project fraud-demo from MLRun DB


## Step 1 - Create a Feature Vector  
In this section we will create our Feature Vector.  
The Feature vector will have a `name` so we can reference to it later via the URI or our serving function, and a list of `features` from the available FeatureSets.  We can add a feature from a feature set by adding `<FeatureSet>.<Feature>` to the list, or add `<FeatureSet>.*` to add all the FeatureSet's available features.  

By default, the first FeatureSet in the feature list will act as the spine. meaning that all the other features will be joined to it.  
For example, in this instance we use the early sense sensor data as our spine, so for each early sense event we will create produce a row in the resulted Feature Vector.

In [3]:
# Define the list of features we will be using
features = ['events.*',
            'transactions.amount_max_2h', 
            'transactions.amount_sum_2h', 
            'transactions.amount_count_2h',
            'transactions.amount_avg_2h', 
            'transactions.amount_max_12h', 
            'transactions.amount_sum_12h',
            'transactions.amount_count_12h', 
            'transactions.amount_avg_12h', 
            'transactions.amount_max_24h',
            'transactions.amount_sum_24h', 
            'transactions.amount_count_24h', 
            'transactions.amount_avg_24h',
            'transactions.es_transportation_sum_14d', 
            'transactions.es_health_sum_14d',
            'transactions.es_otherservices_sum_14d', 
            'transactions.es_food_sum_14d',
            'transactions.es_hotelservices_sum_14d', 
            'transactions.es_barsandrestaurants_sum_14d',
            'transactions.es_tech_sum_14d', 
            'transactions.es_sportsandtoys_sum_14d',
            'transactions.es_wellnessandbeauty_sum_14d', 
            'transactions.es_hyper_sum_14d',
            'transactions.es_fashion_sum_14d', 
            'transactions.es_home_sum_14d', 
            'transactions.es_travel_sum_14d', 
            'transactions.es_leisure_sum_14d',
            'transactions.gender_F',
            'transactions.gender_M',
            'transactions.step', 
            'transactions.amount', 
            'transactions.timestamp_hour',
            'transactions.timestamp_day_of_week']

In [4]:
# Import MLRun's Feature Store
import mlrun.feature_store as fstore

# Define the feature vector name for future reference
fv_name = 'transactions-fraud'

# Define the feature vector using our Feature Store (fstore)
transactions_fv = fstore.FeatureVector(fv_name, 
                          features, 
                          label_feature="labels.label",
                          description='Predicting a fraudulent transaction')

# Save the feature vector in the Feature Store
transactions_fv.save()

## Step 2 - Preview the Feature Vector Data

Obtain the values of the features in the feature vector, to ensure the data appears as expected

In [5]:
# Import the Parquet Target so we can directly save our dataset as a file
from mlrun.datastore.targets import ParquetTarget

# Get offline feature vector as dataframe and save the dataset to parquet
train_dataset = fstore.get_offline_features(fv_name, target=ParquetTarget())

> 2023-02-09 15:35:13,260 [info] wrote target: {'name': 'parquet', 'kind': 'parquet', 'path': 'v3io:///projects/fraud-demo-dani/FeatureStore/transactions-fraud/parquet/vectors/transactions-fraud-latest.parquet', 'status': 'ready', 'updated': '2023-02-09T15:35:13.260443+00:00', 'size': 140888, 'partitioned': True}


In [6]:
# Preview our dataset
train_dataset.to_dataframe().tail(5)

Unnamed: 0,event_details_change,event_login,event_password_change,amount_max_2h,amount_sum_2h,amount_count_2h,amount_avg_2h,amount_max_12h,amount_sum_12h,amount_count_12h,...,es_home_sum_14d,es_travel_sum_14d,es_leisure_sum_14d,gender_F,gender_M,step,amount,timestamp_hour,timestamp_day_of_week,label
1763,0,0,1,45.28,144.56,5.0,28.912,161.75,1017.8,33.0,...,0.0,1.0,0.0,1.0,0.0,96.0,24.02,15.0,3.0,0.0
1764,1,0,0,26.81,47.75,2.0,23.875,68.16,652.19,23.0,...,0.0,0.0,0.0,0.0,1.0,134.0,26.81,15.0,3.0,0.0
1765,0,1,0,33.1,91.11,4.0,22.7775,121.96,1001.32,32.0,...,2.0,0.0,0.0,1.0,0.0,141.0,14.95,14.0,3.0,0.0
1766,0,0,1,71.63,182.18,7.0,26.025714,71.63,1256.23,43.0,...,0.0,0.0,0.0,0.0,1.0,101.0,13.62,14.0,3.0,0.0
1767,0,0,1,44.37,76.87,4.0,19.2175,159.32,1076.14,36.0,...,0.0,0.0,0.0,0.0,1.0,40.0,12.82,15.0,3.0,0.0


## Step 3 - Train Models and Choose Highest Accuracy

With MLRun, one can easily train different models and compare the results. In the code below, we train 3 different models,
each uses a different algorithm (random forest, XGBoost, adabost), and choose the model with the highest accuracy

In [7]:
# Import the Sklearn classifier function from the functions hub
classifier_fn = mlrun.import_function('hub://auto_trainer')

In [8]:
# Prepare the parameters list for the training function
# We will be using 3 different models
training_params = {"model_name": ['transaction_fraud_rf', 
                                  'transaction_fraud_xgboost', 
                                  'transaction_fraud_adaboost'],
              
                  "model_class": ['sklearn.ensemble.RandomForestClassifier',
                                  'sklearn.ensemble.GradientBoostingClassifier',
                                  'sklearn.ensemble.AdaBoostClassifier']}

# Define the training task, including our feature vector, label and hyperparams definitions
train_task = mlrun.new_task('training', 
                      inputs={'dataset': transactions_fv.uri},
                      params={'label_columns': 'label'}
                     )

train_task.with_hyper_params(training_params, strategy='list', selector='max.accuracy')

# Specify our cluster image
classifier_fn.spec.image = 'mlrun/mlrun'

# Run training
classifier_fn.run(train_task, local=False)

> 2023-02-09 15:35:13,756 [info] starting run training uid=89c1af33e93d44bead0a5f63e7c055f0 DB=http://mlrun-api:8080
> 2023-02-09 15:35:13,925 [info] Job is running in the background, pod: training-slwvk
> 2023-02-09 15:35:18,968 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2023-02-09 15:35:20,180 [info] label columns: label
> 2023-02-09 15:35:20,180 [info] Sample set not given, using the whole training set as the sample set
> 2023-02-09 15:35:20,450 [info] training 'transaction_fraud_rf'

The normalize argument is deprecated in v1.1 and will be removed in v1.3. Explicitly normalizing y_prob will reproduce this behavior, but it is recommended that a proper probability is used (i.e. a classifier's `predict_proba` positive class or `decision_function` output calibrated with `CalibratedClassifierCV`).

> 2023-02-09 15:35:23,472 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2023-02-09 1

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
fraud-demo-dani,...e7c055f0,0,Feb 09 15:35:18,completed,training,v3io_user=danikind=jobowner=danimlrun/client_version=1.3.0-rc21mlrun/client_python_version=3.9.13,dataset,label_columns=label,best_iteration=3accuracy=0.9971751412429378f1_score=0.9090909090909091precision_score=1.0recall_score=0.8333333333333334,feature-importancetest_setconfusion-matrixroc-curvescalibration-curvemodeliteration_resultsparallel_coordinates





> 2023-02-09 15:35:35,492 [info] run executed, status=completed


<mlrun.model.RunObject at 0x7f5d537d1eb0>

## Step 4 - Perform Feature Selection

As part of our data science process we will try and reduce the training dataset's size to get rid of bad or unuseful features and save computation time.

We will use our ready-made feature selection function from our hub [`hub://feature_selection`](https://github.com/mlrun/functions/blob/development/feature_selection/feature_selection.ipynb) to select the best features to keep on a sample from our dataset and run the function on that.


In [9]:
feature_selection_fn = mlrun.import_function('hub://feature_selection')

feature_selection_run = feature_selection_fn.run(
            params={"k": 18,
                    "min_votes": 2,
                    "label_column": 'label',
                    'output_vector_name':fv_name + "-short",
                    'ignore_type_errors': True},
    
            inputs={'df_artifact': transactions_fv.uri},
            name='feature_extraction',
            handler='feature_selection',
    local=False)

> 2023-02-09 15:35:35,875 [info] starting run feature_extraction uid=c3746e3ae9b54a7db500b6b51cc59482 DB=http://mlrun-api:8080
> 2023-02-09 15:35:36,036 [info] Job is running in the background, pod: feature-extraction-xmts7
Names with underscore '_' are about to be deprecated, use dashes '-' instead.Replacing underscores with dashes.
Liblinear failed to converge, increase the number of iterations.
Liblinear failed to converge, increase the number of iterations.
> 2023-02-09 15:35:48,035 [info] votes needed to be selected: 2
> 2023-02-09 15:35:49,162 [info] wrote target: {'name': 'parquet', 'kind': 'parquet', 'path': 'v3io:///projects/fraud-demo-dani/FeatureStore/transactions-fraud-short/parquet/vectors/transactions-fraud-short-latest.parquet', 'status': 'ready', 'updated': '2023-02-09T15:35:49.162900+00:00', 'size': 569726, 'partitioned': True}
> 2023-02-09 15:35:50,026 [info] To track results use the CLI: {'info_cmd': 'mlrun get run c3746e3ae9b54a7db500b6b51cc59482 -p fraud-demo-dani'

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
fraud-demo-dani,...1cc59482,0,Feb 09 15:35:40,completed,feature_extraction,v3io_user=danikind=jobowner=danimlrun/client_version=1.3.0-rc21mlrun/client_python_version=3.9.13host=feature-extraction-xmts7,df_artifact,k=18min_votes=2label_column=labeloutput_vector_name=transactions-fraud-shortignore_type_errors=True,top_features_vector=store://feature-vectors/fraud-demo-dani/transactions-fraud-short,f_classifmutual_info_classifchi2f_regressionLinearSVCLogisticRegressionExtraTreesClassifierfeature_scoresmax_scaled_scores_feature_scoresselected_features_countselected_features





> 2023-02-09 15:35:52,817 [info] run executed, status=completed


In [10]:
mlrun.get_dataitem(feature_selection_run.outputs['top_features_vector']).as_df().tail(5)

Unnamed: 0,amount_max_2h,amount_sum_2h,amount_count_2h,amount_avg_2h,amount_max_12h,amount_sum_12h,amount_count_12h,amount_avg_12h,amount_max_24h,amount_sum_24h,amount_count_24h,amount_avg_24h,es_transportation_sum_14d,es_health_sum_14d,es_otherservices_sum_14d,label
9995,54.55,82.91,3.0,27.636667,70.47,719.55,25.0,28.782,85.97,1710.85,57.0,30.014912,120.0,0.0,0.0,0
9996,31.14,31.14,1.0,31.14,31.14,31.14,1.0,31.14,119.5,330.61,5.0,66.122,0.0,7.0,0.0,0
9997,218.48,346.72,4.0,86.68,218.48,1005.1,23.0,43.7,218.48,1927.7,58.0,33.236207,107.0,5.0,1.0,0
9998,34.93,118.22,5.0,23.644,79.16,935.26,31.0,30.169677,89.85,2062.69,68.0,30.333676,116.0,0.0,0.0,0
9999,77.76,189.08,3.0,63.026667,77.76,1099.98,35.0,31.428,95.71,2451.98,72.0,34.055278,122.0,0.0,0.0,0


## Step 5 - Train our models with top features

Following the feature selection, we train new models using the resultant features. We can observe the accuracy and other results remain high
meaning we get a model that requires less features to be accurate and thus less error-prone.

In [11]:
# Prepare the parameters list for the training function
# We will be using 3 different models
training_params = {"model_name": ['transaction_fraud_rf', 
                                  'transaction_fraud_xgboost', 
                                  'transaction_fraud_adaboost'],
              
                  "model_class": ['sklearn.ensemble.RandomForestClassifier',
                                  'sklearn.ensemble.GradientBoostingClassifier',
                                  'sklearn.ensemble.AdaBoostClassifier']}

# Defining our training task, including our feature vector, label and hyperparams definitions
ensemble_train_task = mlrun.new_task('training', 
                      inputs={'dataset': feature_selection_run.outputs['top_features_vector']},
                      params={'label_columns': 'label'}
                     )
ensemble_train_task.with_hyper_params(training_params, strategy='list', selector='max.accuracy')

classifier_fn.run(ensemble_train_task)

> 2023-02-09 15:35:52,933 [info] starting run training uid=d22efb04a2fa4f5c8ddb7fbef0d7500d DB=http://mlrun-api:8080
> 2023-02-09 15:35:53,092 [info] Job is running in the background, pod: training-cskvh
> 2023-02-09 15:35:58,230 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2023-02-09 15:35:58,984 [info] label columns: label
> 2023-02-09 15:35:58,984 [info] Sample set not given, using the whole training set as the sample set
> 2023-02-09 15:35:59,250 [info] training 'transaction_fraud_top_rf'

The normalize argument is deprecated in v1.1 and will be removed in v1.3. Explicitly normalizing y_prob will reproduce this behavior, but it is recommended that a proper probability is used (i.e. a classifier's `predict_proba` positive class or `decision_function` output calibrated with `CalibratedClassifierCV`).

> 2023-02-09 15:36:02,607 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2023-02-

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
fraud-demo-dani,...f0d7500d,0,Feb 09 15:35:57,completed,training,v3io_user=danikind=jobowner=danimlrun/client_version=1.3.0-rc21mlrun/client_python_version=3.9.13,dataset,label_columns=labeltag=latest,best_iteration=3accuracy=0.989f1_score=0.15384615384615385precision_score=0.2857142857142857recall_score=0.10526315789473684,feature-importancetest_setconfusion-matrixroc-curvescalibration-curvemodeliteration_resultsparallel_coordinates





> 2023-02-09 15:36:14,622 [info] run executed, status=completed


<mlrun.model.RunObject at 0x7f5d52925b80>

## Done!

You've completed Part 2 of the model training with the feature store.
Proceed to [Part 3](03-deploy-serving-model.ipynb) to learn how to deploy and monitor the model.