# Model Training and Validation Pipeline
Now that you have created features, you can use them to train one or more models. In this section, you will generate feature vectors with multiple features from one or more feature sets and feed them into an automated ML training and testing pipeline to create high-quality models.

The ML pipeline can be triggered and tracked manually during the interactive devel‐ opment, or it can be saved (into Git) and be executed automatically on a given schedule or as a reaction to different events (such as code modification, CI/CD, data changes, model drift, and so on). See [MLRun project and CI/CD documentation](https://docs.mlrun.org/en/stable/projects/project.html) for details.


### Saving and loading projects from GIT

After you saved your project and its elements (functions, workflows, artifacts, etc.) you can commit all your changes to a 
GIT repository. This can be done using standard GIT tools or using MLRun `project` methods such as `pull`, `push`, 
`remote`, which calls the Git API for you.

Projects can then be loaded from Git using MLRun `load_project` method, for example: 

    project = mlrun.load_project("./myproj", "git://github.com/mlrun/project-demo.git", name=project_name)
    
or using MLRun CLI:

    mlrun project -n myproj -u "git://github.com/mlrun/project-demo.git" ./myproj
    
Projects can be loaded or created by using MLRun `get_or_create_project` method.
    
Read [CI/CD integration](../../projects/ci-integration.html) for more details.

In [1]:
import mlrun
project = mlrun.get_or_create_project(
    name="fraud-demo",
    context="./",
    user_project=True,
    )

> 2025-01-06 11:44:28,656 [info] Project loaded successfully: {"project_name":"fraud-demo-jovyan"}


## Creating and Evaluating a Feature Vector

Models are trained with multiple features, which can arrive from different feature sets and be collected into training (feature) vectors. Feature stores know how to correctly combine the features into a vector by implementing smart JOINs and assessing the time dimension (time traveling).
To define a feature vector, you need to specify a name, the list of features it contains, the target features (labels), and other optional parameters. Features are specified as `<FeatureSet>.<Feature> or <FeatureSet>.*`  (all the features in a feature set). The following part demonstrates how to create and use a feature vector.


### Create a feature vector

In [2]:
# Import MLRun's Feature Store
import mlrun.feature_store as fstore

# Define the list of features to use
features = ['events.*',
            'transactions.amount_max_2h', 
            'transactions.amount_sum_2h', 
            'transactions.amount_count_2h',
            'transactions.amount_avg_2h', 
            'transactions.amount_max_12h', 
            'transactions.amount_sum_12h',
            'transactions.amount_count_12h', 
            'transactions.amount_avg_12h', 
            'transactions.amount_max_24h',
            'transactions.amount_sum_24h', 
            'transactions.amount_count_24h', 
            'transactions.amount_avg_24h',
            'transactions.es_transportation_sum_14d', 
            'transactions.es_health_sum_14d',
            'transactions.es_otherservices_sum_14d', 
            'transactions.es_food_sum_14d',
            'transactions.es_hotelservices_sum_14d', 
            'transactions.es_barsandrestaurants_sum_14d',
            'transactions.es_tech_sum_14d', 
            'transactions.es_sportsandtoys_sum_14d',
            'transactions.es_wellnessandbeauty_sum_14d', 
            'transactions.es_hyper_sum_14d',
            'transactions.es_fashion_sum_14d', 
            'transactions.es_home_sum_14d', 
            'transactions.es_travel_sum_14d', 
            'transactions.es_leisure_sum_14d',
            'transactions.gender_F',
            'transactions.gender_M',
            'transactions.step', 
            'transactions.amount', 
            'transactions.timestamp_hour',
            'transactions.timestamp_day_of_week']

In [3]:
# Define the feature vector name for future reference
fv_name = 'transactions-fraud'

# Define the feature vector using the feature store (fstore)
transactions_fv = fstore.FeatureVector(fv_name, 
                          features, 
                          label_feature="labels.label",
                          description='Predicting a fraudulent transaction')

# Save the feature vector in the feature store
transactions_fv.save()

Once you have defined the feature vector, you can use `get_offline_features()` to generate the vector dataset and return it as a dataframe or materialize it into a file (CSV or Parquet). The next part demonstrates how to retrieve a vector, materialize it, and view its results.

## Building and Running an Automated Training and Validation Pipeline

MLRun allows the building of distributed ML pipelines that can handle data processing, automated feature selection, training, optimization, testing, deployments, and so on. Pipelines are composed of steps that run or deploy custom or library (from the MLRun hub) serverless functions. Pipelines can be run locally (for debugging or small-scale tasks), on a scalable Kubernetes cluster (using Kubeflow), or in a CI/CD system.

The example consists of the following pipeline steps (all using pre-defined MLRun hub functions):

1. Materialize a feature vector (using `src/get_vector`). 
2. Select the most optimal features (using `hub://feature_selection`).
3. Train the model with multiple algorithms (using `hub://auto_trainer`).
4. Evaluate the model (using `hub://auto_trainer`).
5. Deploy the model and its application to the test cluster (using `hub://v2_model_server`). The next section will explain the model and application pipeline in detail.

Each step can accept the previous steps’ results or data, and generate results, multiple visual artifacts/charts, versioned data objects, and registered models.

We have defined the workflow in [`src/new_train_workflow.py`](./src/new_train_workflow.py). 

## Running the ML pipeline

The workflow/pipeline can be executed using the MLRun SDK (`project.run()` method) or using CLI commands (mlrun project), and can run directly from the source repo (GIT). See details in [MLRun Projects and Automation documentation](https://docs.mlrun.org/en/stable/projects/project.html).

You can set arguments and destinations for the different artifacts when you run the workflow. The pipeline progress and results are shown in the notebook. Alternatively, you can check the progress, logs, artifacts, and more, in the MLRun UI or the CI/CD system. The next part demonstrates how to run the pipeline with custom arguments using the SDK.

In [4]:
run_id = project.run(
    'main',
    arguments={'vector_name':"transactions-fraud",
               'features': features,
                'label_column':"labels.label",
              }, 
    dirty=True, watch=True)



> 2025-01-06 11:44:33,195 [info] Storing function: {"db":null,"name":"get-vector","uid":"f077d76300b446218de2a8b1bd260c6f"}
> 2025-01-06 11:44:34,052 [info] Merger detected timestamp resolution incompatibility between feature set labels and others: datetime64[us] and datetime64[ms]. Converting feature set timestamp column 'timestamp' to type datetime64[us].
> 2025-01-06 11:44:34,135 [info] wrote target: {'partitioned': True, 'size': 151159, 'kind': 'parquet', 'status': 'ready', 'updated': '2025-01-06T11:44:34.135775+00:00', 'name': 'parquet', 'path': 's3://mlrun/projects/fraud-demo-jovyan/FeatureStore/transactions-fraud/parquet/vectors/transactions-fraud-latest.parquet'}


project,uid,iter,start,state,kind,name,labels,inputs,parameters,results
fraud-demo-jovyan,...bd260c6f,0,Jan 06 11:44:33,completed,run,get-vector,workflow=4133f7a9627d4b048e7feddbdea5f3e3kind=localowner=jovyanhost=mlrun-jupyter-764bfd486c-rnknv,,"feature_vector=transactions-fraudfeatures=['events.*', 'transactions.amount_max_2h', 'transactions.amount_sum_2h', 'transactions.amount_count_2h', 'transactions.amount_avg_2h', 'transactions.amount_max_12h', 'transactions.amount_sum_12h', 'transactions.amount_count_12h', 'transactions.amount_avg_12h', 'transactions.amount_max_24h', 'transactions.amount_sum_24h', 'transactions.amount_count_24h', 'transactions.amount_avg_24h', 'transactions.es_transportation_sum_14d', 'transactions.es_health_sum_14d', 'transactions.es_otherservices_sum_14d', 'transactions.es_food_sum_14d', 'transactions.es_hotelservices_sum_14d', 'transactions.es_barsandrestaurants_sum_14d', 'transactions.es_tech_sum_14d', 'transactions.es_sportsandtoys_sum_14d', 'transactions.es_wellnessandbeauty_sum_14d', 'transactions.es_hyper_sum_14d', 'transactions.es_fashion_sum_14d', 'transactions.es_home_sum_14d', 'transactions.es_travel_sum_14d', 'transactions.es_leisure_sum_14d', 'transactions.gender_F', 'transactions.gender_M', 'transactions.step', 'transactions.amount', 'transactions.timestamp_hour', 'transactions.timestamp_day_of_week']label_feature=labels.labeltarget={'name': 'parquet', 'kind': 'parquet'}update_stats=True",return=





> 2025-01-06 11:44:34,217 [info] Run execution finished: {"name":"get-vector","status":"completed"}
> 2025-01-06 11:44:34,218 [info] Storing function: {"db":null,"name":"feature-selection","uid":"9f000c66a89a409f8efb824b3e348b9b"}
> 2025-01-06 11:44:36,623 [info] votes needed to be selected: 2



Call to deprecated function (or staticmethod) get_offline_features. (get_offline_features() will be removed in 1.8.0, please instead use get_feature_vector('store://feature_vector_name').get_offline_features()) -- Deprecated since version 1.6.0.



> 2025-01-06 11:44:36,984 [info] Merger detected timestamp resolution incompatibility between feature set labels and others: datetime64[us] and datetime64[ms]. Converting feature set timestamp column 'timestamp' to type datetime64[us].
> 2025-01-06 11:44:37,006 [info] wrote target: {'partitioned': True, 'size': 96017, 'kind': 'parquet', 'status': 'ready', 'updated': '2025-01-06T11:44:37.006733+00:00', 'name': 'parquet', 'path': 's3://mlrun/projects/fraud-demo-jovyan/FeatureStore/short/parquet/vectors/short-latest.parquet'}


project,uid,iter,start,state,kind,name,labels,inputs,parameters,results,artifacts
fraud-demo-jovyan,...3e348b9b,0,Jan 06 11:44:34,completed,run,feature-selection,workflow=4133f7a9627d4b048e7feddbdea5f3e3kind=localowner=jovyanhost=mlrun-jupyter-764bfd486c-rnknv,df_artifact,output_vector_name=shortlabel_column=labelk=5min_votes=2ignore_type_errors=True,top_features_vector=store://feature-vectors/fraud-demo-jovyan/short,f_classifmutual_info_classifchi2f_regressionLinearSVCLogisticRegressionExtraTreesClassifierfeature_scoresmax_scaled_scores_feature_scoresselected_features_countselected_features





> 2025-01-06 11:44:37,251 [info] Run execution finished: {"name":"feature-selection","status":"completed"}
> 2025-01-06 11:44:37,252 [info] Storing function: {"db":null,"name":"train","uid":"1ff442f179ab4292825e3137c04b1b3b"}
> 2025-01-06 11:44:37,325 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2025-01-06 11:44:37,325 [info] label columns: label
> 2025-01-06 11:44:37,456 [info] Merger detected timestamp resolution incompatibility between feature set labels and others: datetime64[us] and datetime64[ms]. Converting feature set timestamp column 'timestamp' to type datetime64[us].
> 2025-01-06 11:44:37,461 [info] Sample set not given, using the whole training set as the sample set
> 2025-01-06 11:44:37,556 [info] training 'transaction_fraud_rf'
> 2025-01-06 11:44:38,802 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2025-01-06 11:44:38,803 [info] label columns: label
> 2025-01-06 11:44:




Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.



> 2025-01-06 11:44:40,905 [info] best iteration=1, used criteria max.accuracy


project,uid,iter,start,state,kind,name,labels,inputs,parameters,results,artifacts
fraud-demo-jovyan,...c04b1b3b,0,Jan 06 11:44:37,completed,run,train,workflow=4133f7a9627d4b048e7feddbdea5f3e3kind=localowner=jovyan,dataset,sample=-1label_column=labeltest_size=0.1,best_iteration=1accuracy=0.9925f1_score=0.11764705882352941precision_score=0.2recall_score=0.08333333333333333,feature-importancetest_setconfusion-matrixroc-curvescalibration-curvemodeliteration_resultsparallel_coordinates





> 2025-01-06 11:44:41,523 [info] Run execution finished: {"name":"train","status":"completed"}
> 2025-01-06 11:44:41,525 [info] Storing function: {"db":null,"name":"evaluate","uid":"d401187210154d11b5ec39e97fce364c"}
> 2025-01-06 11:44:41,630 [info] not all of the columns to drop in the dataset, drop columns process skipped
> 2025-01-06 11:44:41,655 [info] evaluating 'model_LinearRegression'


project,uid,iter,start,state,kind,name,labels,inputs,parameters,results,artifacts
fraud-demo-jovyan,...7fce364c,0,Jan 06 11:44:41,completed,run,evaluate,workflow=4133f7a9627d4b048e7feddbdea5f3e3kind=localowner=jovyanhost=mlrun-jupyter-764bfd486c-rnknv,dataset,label_columns=labelmodel=store://models/fraud-demo-jovyan/transaction_fraud_rf:latest@4133f7a9627d4b048e7feddbdea5f3e3drop_columns=label,evaluation_accuracy=0.9925evaluation_f1_score=0.11764705882352941evaluation_precision_score=0.2evaluation_recall_score=0.08333333333333333,evaluation-test_setevaluation-roc-curvesevaluation-calibration-curveevaluation-confusion-matrix





> 2025-01-06 11:44:42,321 [info] Run execution finished: {"name":"evaluate","status":"completed"}
> 2025-01-06 11:44:42,337 [info] Starting remote function deploy
2025-01-06 11:44:42  (info) Deploying function
2025-01-06 11:44:42  (info) Building
2025-01-06 11:44:42  (info) Staging files and preparing base images
2025-01-06 11:44:42  (warn) Using user provided base image, runtime interpreter version is provided by the base image
2025-01-06 11:44:42  (info) Building processor image
2025-01-06 11:48:17  (info) Build complete
2025-01-06 11:48:31  (info) Function deploy complete
> 2025-01-06 11:48:34,791 [info] Successfully deployed function: {"external_invocation_urls":["localhost:32061"],"internal_invocation_urls":["nuclio-fraud-demo-jovyan-serving.mlrun.svc.cluster.local:8080"]}


uid,start,state,kind,name,parameters,results
...bd260c6f,Jan 06 11:44:33,completed,,get-vector,"feature_vector=transactions-fraudfeatures=['events.*', 'transactions.amount_max_2h', 'transactions.amount_sum_2h', 'transactions.amount_count_2h', 'transactions.amount_avg_2h', 'transactions.amount_max_12h', 'transactions.amount_sum_12h', 'transactions.amount_count_12h', 'transactions.amount_avg_12h', 'transactions.amount_max_24h', 'transactions.amount_sum_24h', 'transactions.amount_count_24h', 'transactions.amount_avg_24h', 'transactions.es_transportation_sum_14d', 'transactions.es_health_sum_14d', 'transactions.es_otherservices_sum_14d', 'transactions.es_food_sum_14d', 'transactions.es_hotelservices_sum_14d', 'transactions.es_barsandrestaurants_sum_14d', 'transactions.es_tech_sum_14d', 'transactions.es_sportsandtoys_sum_14d', 'transactions.es_wellnessandbeauty_sum_14d', 'transactions.es_hyper_sum_14d', 'transactions.es_fashion_sum_14d', 'transactions.es_home_sum_14d', 'transactions.es_travel_sum_14d', 'transactions.es_leisure_sum_14d', 'transactions.gender_F', 'transactions.gender_M', 'transactions.step', 'transactions.amount', 'transactions.timestamp_hour', 'transactions.timestamp_day_of_week']label_feature=labels.labeltarget={'name': 'parquet', 'kind': 'parquet'}update_stats=True",return=
...3e348b9b,Jan 06 11:44:34,completed,,feature-selection,output_vector_name=shortlabel_column=labelk=5min_votes=2ignore_type_errors=True,top_features_vector=store://feature-vectors/fraud-demo-jovyan/short
...c04b1b3b,Jan 06 11:44:37,completed,,train,sample=-1label_column=labeltest_size=0.1,best_iteration=1accuracy=0.9925f1_score=0.11764705882352941precision_score=0.2recall_score=0.08333333333333333
...7fce364c,Jan 06 11:44:41,completed,,evaluate,label_columns=labelmodel=store://models/fraud-demo-jovyan/transaction_fraud_rf:latest@4133f7a9627d4b048e7feddbdea5f3e3drop_columns=label,evaluation_accuracy=0.9925evaluation_f1_score=0.11764705882352941evaluation_precision_score=0.2evaluation_recall_score=0.08333333333333333


> 2025-01-06 11:48:34,843 [info] Started run workflow fraud-demo-jovyan-main with run id = '4133f7a9627d4b048e7feddbdea5f3e3' by local engine


uid,start,state,kind,name,parameters,results
...7fce364c,Jan 06 11:44:41,completed,run,evaluate,label_columns=labelmodel=store://models/fraud-demo-jovyan/transaction_fraud_rf:latest@4133f7a9627d4b048e7feddbdea5f3e3drop_columns=label,evaluation_accuracy=0.9925evaluation_f1_score=0.11764705882352941evaluation_precision_score=0.2evaluation_recall_score=0.08333333333333333
...c04b1b3b,Jan 06 11:44:37,completed,run,train,sample=-1label_column=labeltest_size=0.1,best_iteration=1accuracy=0.9925f1_score=0.11764705882352941precision_score=0.2recall_score=0.08333333333333333
...3e348b9b,Jan 06 11:44:34,completed,run,feature-selection,output_vector_name=shortlabel_column=labelk=5min_votes=2ignore_type_errors=True,top_features_vector=store://feature-vectors/fraud-demo-jovyan/short
...bd260c6f,Jan 06 11:44:33,completed,run,get-vector,"feature_vector=transactions-fraudfeatures=['events.*', 'transactions.amount_max_2h', 'transactions.amount_sum_2h', 'transactions.amount_count_2h', 'transactions.amount_avg_2h', 'transactions.amount_max_12h', 'transactions.amount_sum_12h', 'transactions.amount_count_12h', 'transactions.amount_avg_12h', 'transactions.amount_max_24h', 'transactions.amount_sum_24h', 'transactions.amount_count_24h', 'transactions.amount_avg_24h', 'transactions.es_transportation_sum_14d', 'transactions.es_health_sum_14d', 'transactions.es_otherservices_sum_14d', 'transactions.es_food_sum_14d', 'transactions.es_hotelservices_sum_14d', 'transactions.es_barsandrestaurants_sum_14d', 'transactions.es_tech_sum_14d', 'transactions.es_sportsandtoys_sum_14d', 'transactions.es_wellnessandbeauty_sum_14d', 'transactions.es_hyper_sum_14d', 'transactions.es_fashion_sum_14d', 'transactions.es_home_sum_14d', 'transactions.es_travel_sum_14d', 'transactions.es_leisure_sum_14d', 'transactions.gender_F', 'transactions.gender_M', 'transactions.step', 'transactions.amount', 'transactions.timestamp_hour', 'transactions.timestamp_day_of_week']label_feature=labels.labeltarget={'name': 'parquet', 'kind': 'parquet'}update_stats=True",return=


## Test the model endpoint


Now that your model is deployed using the pipeline, you can invoke it as usual:

In [5]:
# Define your serving function
serving_fn = project.get_function('serving')

# Choose an id for your test
sample_id = 'C1000148617'
model_inference_path = '/v2/models/fraud/infer'

# Send our sample ID for predcition
serving_fn.invoke(path=model_inference_path,
                  body={'inputs': [[sample_id]]})

> 2025-01-06 11:48:34,915 [info] Invoking function: {"method":"POST","path":"http://nuclio-fraud-demo-jovyan-serving.mlrun.svc.cluster.local:8080/v2/models/fraud/infer"}


{'id': '042e9124-32ba-4cb0-8471-61e4c2c8d9b5',
 'model_name': 'fraud',
 'outputs': [0],
 'timestamp': '2025-01-06 11:48:34.976544+00:00',
 'model_version': 'latest'}

## Done!

You've completed part 4 - the model training with the feature store.
Proceed to [Part 5](06-real-time-serving-pipeline.ipynb) to learn how to deploy real-time application pipelines.