# Part 2: Training  
In this part we will show how using MLRun's **Feature Store** we can easily define a **Feature Vector** and create the dataset we need to run our training process.  
By the end of this tutorial you’ll learn how to:
- Combine multiple data sources to a single Feature Vector
- Create training dataset
- Create a model using an MLRun Hub function

In [1]:
import mlrun

project_name, _ = mlrun.set_environment(project='fraud-demo', 
                                        user_project=True)

## Create Feature Vector  
In this section we will create our Feature Vector.  
The Feature vector will have a `name` so we can reference to it later via the URI or our serving function, and a list of `features` from the available FeatureSets.  We can add a feature from a feature set by adding `<FeatureSet>.<Feature>` to the list, or add `<FeatureSet>.*` to add all the FeatureSet's available features.  

By default, the first FeatureSet in the feature list will act as the spine. meaning that all the other features will be joined to it.  
For example, in this instance we use the early_sense sensor data as our spine, so for each early_sense event we will create produce a row in the resulted Feature Vector.

In [2]:
# Define the list of features we will be using
features = ['transactions.amount_max_2h', 
            'transactions.amount_sum_2h', 
            'transactions.amount_count_2h',
            'transactions.amount_avg_2h', 
            'transactions.amount_max_12h', 
            'transactions.amount_sum_12h',
            'transactions.amount_count_12h', 
            'transactions.amount_avg_12h', 
            'transactions.amount_max_24h',
            'transactions.amount_sum_24h', 
            'transactions.amount_count_24h', 
            'transactions.amount_avg_24h',
            'transactions.es_transportation_count_14d', 
            'transactions.es_health_count_14d',
            'transactions.es_otherservices_count_14d', 
            'transactions.es_food_count_14d',
            'transactions.es_hotelservices_count_14d', 
            'transactions.es_barsandrestaurants_count_14d',
            'transactions.es_tech_count_14d', 
            'transactions.es_sportsandtoys_count_14d',
            'transactions.es_wellnessandbeauty_count_14d', 
            'transactions.es_hyper_count_14d',
            'transactions.es_fashion_count_14d', 
            'transactions.es_home_count_14d', 
            'transactions.es_travel_count_14d', 
            'transactions.es_leisure_count_14d',
            'transactions.age_mapped',
            'transactions.gender_F',
            'transactions.gender_M',
            'transactions.step', 
            'transactions.amount', 
            'transactions.timestamp_hour',
            'transactions.timestamp_day_of_week',
            'events.*']

In [3]:
# Import MLRun's Feature Store
import mlrun.feature_store as fstore

# Define the feature vector name for future reference
feature_vector_name = 'transactions-fraud'

# Define the feature vector using our Feature Store (fs)
fv = fstore.FeatureVector(feature_vector_name, 
                          features, 
                          label_feature="labels.label",
                          description='Predicting a fraudulent transaction')

# Save the feature vector in the Feature Store
fv.save()

## Produce training dataset 

In [4]:
# Import the Parquet Target so we can directly save our dataset as a file
from mlrun.datastore.targets import ParquetTarget

# Get offline feature vector
# will return a pandas dataframe and save the dataset to parquet so a 
# training job can train on it
dataset = fstore.get_offline_features(feature_vector_name, target=ParquetTarget())

> 2021-08-11 06:21:17,437 [info] wrote target: {'name': 'parquet', 'kind': 'parquet', 'path': 'v3io:///projects/fraud-demo-gilads/FeatureStore/transactions-fraud/parquet/vectors/transactions-fraud-latest.parquet', 'status': 'ready', 'updated': '2021-08-11T06:21:17.437742+00:00', 'size': 1963378}


In [5]:
# Preview our dataset
df = dataset.to_dataframe()
# df['age_mapped'] = df['age_mapped'].astype(int)
# df.tail()

## Run training with AutoML as a cluster job

In [6]:
from mlrun.platforms import auto_mount

# Import the SKLearn based training function from our functions hub
fn = mlrun.import_function('hub://sklearn-classifier').apply(auto_mount())

In [7]:
# Prepare the parameters list for the training function
# We use 3 different models to test on our dataset
model_list = {"model_name": ['transaction_fraud_rf', 
                             'transaction_fraud_xgboost', 
                             'transaction_fraud_adaboost'],
              
              "model_pkg_class": ['sklearn.ensemble.RandomForestClassifier',
                                  'sklearn.ensemble.GradientBoostingClassifier',
                                  'sklearn.ensemble.AdaBoostClassifier']}

# Define the training task, including our feature vector, label and hyperparams definitions
task = mlrun.new_task('training', 
                      inputs={'dataset': fv.uri},
                      params={'label_column': 'label'}
                     )

task.with_hyper_params(model_list, strategy='list', selector='max.accuracy')

# Run the function 
fn.spec.image = 'mlrun/mlrun'
train = fn.run(task, local=False)

> 2021-08-11 06:21:17,743 [info] starting run training uid=68b6abbaffe7415ca7ff389613a1752b DB=http://mlrun-api:8080
> 2021-08-11 06:21:17,937 [info] Job is running in the background, pod: training-trggq
> 2021-08-11 06:21:43,233 [info] best iteration=3, used criteria max.accuracy
> 2021-08-11 06:21:44,022 [info] run executed, status=completed
final state: completed


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
fraud-demo-gilads,...13a1752b,0,Aug 11 06:21:22,completed,training,v3io_user=giladskind=jobowner=gilads,dataset,label_column=label,best_iteration=3accuracy=0.9919416730621642test-error=0.008058326937835763rocauc=0.9466440737402435brier_score=0.17948227103123096f1-score=0.6379310344827586precision_score=0.8222222222222222recall_score=0.5211267605633803,test_setprobability-calibrationconfusion-matrixfeature-importancesprecision-recall-binaryroc-binarymodeliteration_results


to track results use .show() or .logs() or in CLI: 
!mlrun get run 68b6abbaffe7415ca7ff389613a1752b --project fraud-demo-gilads , !mlrun logs 68b6abbaffe7415ca7ff389613a1752b --project fraud-demo-gilads
> 2021-08-11 06:21:47,271 [info] run executed, status=completed


## Perform feature selection process on a sample of the dataset (Using mlrun marketplace function)

As part of our data science process we will try and reduce the training dataset's size to get rid of bad or unuseful features and save computation time.

We will use our ready-made feature selection function from our hub [`hub://feature_selection`](https://github.com/mlrun/functions/blob/development/feature_selection/feature_selection.ipynb) to select the best features to keep on a sample from our dataset and run the function on that.


### Perform feature selection

In [8]:
myfn = mlrun.code_to_function('feature-selection_v2', kind='job',
                              filename='feature_selection.py', image='mlrun/mlrun',
                              description = "")

feature_selection_run = myfn.run(
            params={'k': 5, 'sample_ratio':0.25,
                    'output_vector_name':feature_vector_name + "-short",
                   'ignore_type_errors': True},
            inputs={'df_artifact': fv.uri},
            name='feature_extraction',
            handler='feature_selection',
            local=True)

> 2021-08-11 06:21:47,329 [info] starting run feature_extraction uid=e1a89257252d45aaa29d287bfd5d46b4 DB=http://mlrun-api:8080


Pass k=5 as keyword args. From version 0.25 passing these as positional arguments will result in an error


> 2021-08-11 06:21:52,879 [info] Couldn't calculate chi2 because of: Input X must be non-negative.


Liblinear failed to converge, increase the number of iterations.
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


> 2021-08-11 06:21:56,522 [info] votes needed to be selected: 3
> 2021-08-11 06:21:57,223 [info] wrote target: {'name': 'parquet', 'kind': 'parquet', 'path': 'v3io:///projects/fraud-demo-gilads/FeatureStore/transactions-fraud-short/parquet/vectors/transactions-fraud-short-latest.parquet', 'status': 'ready', 'updated': '2021-08-11T06:21:57.223848+00:00', 'size': 670030}


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
fraud-demo-gilads,...fd5d46b4,0,Aug 11 06:21:49,completed,feature_extraction,v3io_user=giladskind=owner=giladshost=gilad-jupyter-744df8769-w5pqc,df_artifact,k=5sample_ratio=0.25output_vector_name=transactions-fraud-shortignore_type_errors=True,top_features_vector=store://feature-vectors/fraud-demo-gilads/transactions-fraud-short,f_classifmutual_info_classiff_regressionLinearSVCLogisticRegressionExtraTreesClassifierfeature_scoresmax_scaled_scores_feature_scoresselected_features_countselected_features


to track results use .show() or .logs() or in CLI: 
!mlrun get run e1a89257252d45aaa29d287bfd5d46b4 --project fraud-demo-gilads , !mlrun logs e1a89257252d45aaa29d287bfd5d46b4 --project fraud-demo-gilads
> 2021-08-11 06:21:57,422 [info] run executed, status=completed


In [9]:
top_features_df = mlrun.get_dataitem(feature_selection_run.outputs['top_features_vector']).as_df()
top_features_df.tail()

Unnamed: 0,amount_max_2h,amount_sum_2h,amount_count_2h,amount_avg_2h,amount_max_12h,label
49996,27.3,27.3,1.0,27.3,127.82,0
49997,7.89,7.89,1.0,7.89,7.89,0
49998,34.04,34.04,1.0,34.04,34.04,0
49999,52.6,88.88,2.0,44.44,52.6,0
50000,12.81,12.81,1.0,12.81,56.85,0


## Train ensemble of models with top features

In [10]:
# Defining our training task, including our feature vector, label and hyperparams definitions
task = mlrun.new_task('training', 
                      inputs={'dataset': feature_selection_run.outputs['top_features_vector']},
                      params={'label_column': 'label'}
                     )
task.with_hyper_params(model_list, strategy='list', selector='max.accuracy')

run = fn.run(task)

> 2021-08-11 06:21:57,507 [info] starting run training uid=208d1168179b4f8e987500795d56be06 DB=http://mlrun-api:8080
> 2021-08-11 06:21:57,680 [info] Job is running in the background, pod: training-wmnqr
> 2021-08-11 06:22:14,784 [info] best iteration=3, used criteria max.accuracy
> 2021-08-11 06:22:15,624 [info] run executed, status=completed
final state: completed


project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
fraud-demo-gilads,...5d56be06,0,Aug 11 06:22:02,completed,training,v3io_user=giladskind=jobowner=gilads,dataset,label_column=label,best_iteration=3accuracy=0.9924242424242424test-error=0.007575757575757576rocauc=0.9052171136653895brier_score=0.19652144874835656f1-score=0.6precision_score=0.75recall_score=0.5,test_setprobability-calibrationconfusion-matrixfeature-importancesprecision-recall-binaryroc-binarymodeliteration_results


to track results use .show() or .logs() or in CLI: 
!mlrun get run 208d1168179b4f8e987500795d56be06 --project fraud-demo-gilads , !mlrun logs 208d1168179b4f8e987500795d56be06 --project fraud-demo-gilads
> 2021-08-11 06:22:17,153 [info] run executed, status=completed


In [11]:
run.outputs

{'best_iteration': 3,
 'accuracy': 0.9924242424242424,
 'test-error': 0.007575757575757576,
 'rocauc': 0.9052171136653895,
 'brier_score': 0.19652144874835656,
 'f1-score': 0.6,
 'precision_score': 0.75,
 'recall_score': 0.5,
 'test_set': 'store://artifacts/fraud-demo-gilads/training_test_set:208d1168179b4f8e987500795d56be06',
 'probability-calibration': 'v3io:///projects/fraud-demo-gilads/artifacts/model/plots/3/probability-calibration.html',
 'confusion-matrix': 'v3io:///projects/fraud-demo-gilads/artifacts/model/plots/3/confusion-matrix.html',
 'feature-importances': 'v3io:///projects/fraud-demo-gilads/artifacts/model/plots/3/feature-importances.html',
 'precision-recall-binary': 'v3io:///projects/fraud-demo-gilads/artifacts/model/plots/3/precision-recall-binary.html',
 'roc-binary': 'v3io:///projects/fraud-demo-gilads/artifacts/model/plots/3/roc-binary.html',
 'model': 'store://artifacts/fraud-demo-gilads/training_model:208d1168179b4f8e987500795d56be06',
 'iteration_results': 'v3io

**[back to top](#top)**