# Model Training and Validation Pipeline
Now that you have created features, you can use them to train one or more models. In this section, you will generate feature vectors with multiple features from one or more feature sets and feed them into an automated ML training and testing pipeline to create high-quality models.

The ML pipeline can be triggered and tracked manually during the interactive devel‐ opment, or it can be saved (into Git) and be executed automatically on a given schedule or as a reaction to different events (such as code modification, CI/CD, data changes, model drift, and so on). See MLRun project and CI/CD documentation for details.


In [1]:
project_name = 'fraud-demo'

In [2]:
import mlrun

# Initialize the MLRun project object
project = mlrun.get_or_create_project(project_name, context="./", user_project=True)

> 2023-08-03 21:35:22,585 [info] Project loaded successfully: {'project_name': 'fraud-demo'}


In [3]:
# set project level parameters and save
project.spec.params = {'label_column': 'label'}
project.save()

<mlrun.projects.project.MlrunProject at 0x7f573ab82f10>

In [4]:
print(project.to_yaml())

kind: project
metadata:
  name: fraud-demo-pengw
  created: '2023-07-27T02:07:46.064000'
spec:
  params:
    label_column: label
  functions: []
  workflows: []
  artifacts: []
  conda: ''
  source: git@github.com:pengwei715/demo-fraud.git#refs/heads/feature/align_with_book
  desired_state: online
  owner: pengw
  build:
    commands: []
    requirements: []
  custom_packagers: []
status:
  state: online



### Saving and loading projects from GIT

After you saved your project and its elements (functions, workflows, artifacts, etc.) you can commit all your changes to a 
GIT repository. This can be done using standard GIT tools or using MLRun `project` methods such as `pull`, `push`, 
`remote`, which calls the Git API for you.

Projects can then be loaded from Git using MLRun `load_project` method, for example: 

    project = mlrun.load_project("./myproj", "git://github.com/mlrun/project-demo.git", name=project_name)
    
or using MLRun CLI:

    mlrun project -n myproj -u "git://github.com/mlrun/project-demo.git" ./myproj
    
Read [CI/CD integration](../../projects/ci-integration.html) for more details.

## Creating and Evaluating a Feature Vector

Models are trained with multiple features, which can arrive from different feature sets and be collected into training (feature) vectors. Feature stores know how to correctly combine the features into a vector by implementing smart JOINs and assessing the time dimension (time traveling).
To define a feature vector, you need to specify a name, the list of features it contains, the target features (labels), and other optional parameters. Features are specified as `<FeatureSet>.<Feature> or <FeatureSet>.*`  (all the features in a feature set). The following part demonstrates how to create and use a feature vector.


### Create a feature vector

In [5]:
# Import MLRun's Feature Store
import mlrun.feature_store as fstore

# Define the list of features to use
features = ['events.*',
            'transactions.amount_max_2h', 
            'transactions.amount_sum_2h', 
            'transactions.amount_count_2h',
            'transactions.amount_avg_2h', 
            'transactions.amount_max_12h', 
            'transactions.amount_sum_12h',
            'transactions.amount_count_12h', 
            'transactions.amount_avg_12h', 
            'transactions.amount_max_24h',
            'transactions.amount_sum_24h', 
            'transactions.amount_count_24h', 
            'transactions.amount_avg_24h',
            'transactions.es_transportation_sum_14d', 
            'transactions.es_health_sum_14d',
            'transactions.es_otherservices_sum_14d', 
            'transactions.es_food_sum_14d',
            'transactions.es_hotelservices_sum_14d', 
            'transactions.es_barsandrestaurants_sum_14d',
            'transactions.es_tech_sum_14d', 
            'transactions.es_sportsandtoys_sum_14d',
            'transactions.es_wellnessandbeauty_sum_14d', 
            'transactions.es_hyper_sum_14d',
            'transactions.es_fashion_sum_14d', 
            'transactions.es_home_sum_14d', 
            'transactions.es_travel_sum_14d', 
            'transactions.es_leisure_sum_14d',
            'transactions.gender_F',
            'transactions.gender_M',
            'transactions.step', 
            'transactions.amount', 
            'transactions.timestamp_hour',
            'transactions.timestamp_day_of_week']

### Create a feature vector

In [7]:
# Define the feature vector name for future reference
fv_name = 'transactions-fraud'

# Define the feature vector using the feature store (fstore)
transactions_fv = fstore.FeatureVector(fv_name, 
                          features, 
                          label_feature="labels.label",
                          description='Predicting a fraudulent transaction')

# Save the feature vector in the feature store
transactions_fv.save()

## Preview the feature vector data

Obtain the values of the features in the feature vector, to ensure the data appears as expected.

In [8]:
# Import the Parquet Target so you can directly save your dataset as a file
from mlrun.datastore.targets import ParquetTarget

# Get offline feature vector as dataframe and save the dataset to parquet
train_dataset = fstore.get_offline_features(fv_name, target=ParquetTarget())

> 2023-08-03 21:39:07,027 [info] wrote target: {'name': 'parquet', 'kind': 'parquet', 'path': 'v3io:///projects/fraud-demo-pengw/FeatureStore/transactions-fraud/parquet/vectors/transactions-fraud-latest.parquet', 'status': 'ready', 'updated': '2023-08-03T21:39:07.027688+00:00', 'size': 150966, 'partitioned': True}


In [9]:
# Preview your dataset
train_dataset.to_dataframe().head()

Unnamed: 0,event_password_change,event_details_change,event_login,amount_max_2h,amount_sum_2h,amount_count_2h,amount_avg_2h,amount_max_12h,amount_sum_12h,amount_count_12h,...,es_home_sum_14d,es_travel_sum_14d,es_leisure_sum_14d,gender_F,gender_M,step,amount,timestamp_hour,timestamp_day_of_week,label
0,0,0,1,74.89,74.89,1.0,74.89,74.89,74.89,1.0,...,0.0,0.0,0.0,0.0,1.0,55.0,74.89,21.0,1.0,0.0
1,0,0,1,1.83,1.83,1.0,1.83,1.83,1.83,1.0,...,0.0,0.0,0.0,0.0,1.0,72.0,1.83,21.0,1.0,0.0
2,0,0,1,18.72,40.22,3.0,13.406667,18.72,40.22,3.0,...,0.0,0.0,0.0,0.0,1.0,66.0,18.72,21.0,1.0,0.0
3,1,0,0,25.92,67.94,4.0,16.985,25.92,67.94,4.0,...,0.0,0.0,0.0,0.0,1.0,29.0,3.08,21.0,1.0,0.0
4,1,0,0,24.75,30.17,2.0,15.085,24.75,30.17,2.0,...,0.0,0.0,0.0,0.0,1.0,141.0,24.75,21.0,1.0,0.0


## Building and Running an Automated Training and Validation Pipeline

MLRun allows the building of distributed ML pipelines that can handle data process‐ ing, automated feature selection, training, optimization, testing, deployments, and so on. Pipelines are composed of steps that run or deploy custom or library (from the MLRun hub) serverless functions. Pipelines can be run locally (for debugging or small-scale tasks), on a scalable Kubernetes cluster (using Kubeflow), or in a CI/CD system.

The example consists of the following pipeline steps (all using pre-defined MLRun hub functions):

1. Materialize a feature vector (using hub://get_offline_features). 
2. Select the most optimal features (using ``).
3. Train the model with multiple algorithms (using hub://auto_trainer).
4. Evaluate the model (using hub://auto_trainer).
5. Deploy the model and its application to the test cluster (using hub:// v2_model_server). The next section will explain the model and application pipe‐ line in detail.

Each step can accept the previous steps’ results or data, and generate results, multiple visual artifacts/charts, versioned data objects, and registered models.

In [11]:
import kfp
from kfp import dsl
# Create a Kubeflow Pipelines pipeline
@dsl.pipeline(name="Fraud Detection Pipeline",description="Detecting fraud from a transactions dataset",)
def pipeline( vector_name="transactions-fraud", features=[], label_column="is_error",):
    project = mlrun.get_current_project() # Get FeatureVector
    get_vector = mlrun.run_function(
        "hub://get_offline_features",
        name="get_vector",
        params={'feature_vector': vector_name,
        'features': features,
        'label_feature': label_column, "entity_timestamp_column": "timestamp", 'target': {'name': 'parquet', 'kind': 'parquet'}, "update_stats": True},
        outputs=["feature_vector"],
    )
    # Feature selection
    feature_selection = mlrun.run_function(
        "hub://feature_selection",
        name="feature-selection",
        params={
        "output_vector_name": "short",
        "label_column": project.get_param("label_column", "label"), "k": 18,
        "min_votes": 2,
        "ignore_type_errors": True,
        }, 
        inputs={"df_artifact": get_vector.outputs['feature_vector']},
        outputs=[
        "feature_scores",
        "selected_features_count",
        "top_features_vector",
        "selected_features",
        ], 
    )
    # train with hyper-paremeters
    train = mlrun.run_function(
        "hub://auto_trainer",
        name="train",
        handler="train",
        params={
            "sample": -1,
            "label_column": project.get_param("label_column", "label"),
            "test_size": 0.10,
        },
        hyperparams={
            "model_name": [
            "transaction_fraud_rf",
            "transaction_fraud_xgboost",
            "transaction_fraud_adaboost",
        ],
        "model_class": [
            "sklearn.ensemble.RandomForestClassifier",
            "sklearn.linear_model.LogisticRegression",
            "sklearn.ensemble.AdaBoostClassifier",
        ], 
        },
        hyper_param_options=HyperParamOptions(strategy="list",
                                          selector="max.accuracy"),
        inputs={"dataset": feature_selection.outputs["top_features_vector"]},
        outputs=["model", "test_set"],
    )
    # test and visualize your model
    test = mlrun.run_function(
        "hub://auto_trainer",
        name="evaluate",
        handler="evaluate",
        params={
            "label_columns": project.get_param("label_column", "label"),
            "model": train.outputs["model"],
            "drop_columns": project.get_param("label_column", "label"),
        },
        inputs={"dataset": train.outputs["test_set"]},
    )
    # Create a serverless function from the hub, add a feature enrichment router
    # This will enrich and impute the request with data from the feature vector
    serving_function = mlrun.import_function("hub://v2_model_server",
                                             new_name="serving")
    serving_function.set_topology(
        "router",
        mlrun.serving.routers.EnrichmentModelRouter( feature_vector_uri="short", impute_policy={"*": "$mean"}), exist_ok=True
    )
    # Enable model monitoring
    serving_function.set_tracking()
    serving_function.save()
    # deploy the model server, pass a list of trained models to serve
    deploy = mlrun.deploy_function(
        serving_function,
        models=[{"key": "fraud", "model_path": train.outputs["model"]}],
    )

The workflow/pipeline can be executed using the MLRun SDK (project.run() method) or using CLI commands (mlrun project), and can run directly from the source repo (GIT). See details in MLRun Projects and Automation documentation.

You can set arguments and destinations for the different artifacts when you run the workflow. The pipeline progress and results are shown in the notebook. Alternatively, you can check the progress, logs, artifacts, and more, in the MLRun UI or the CI/CD system. The next part demonstrates how to run the pipeline with custom arguments using the SDK.

In [16]:
# Register the workflow file as "main"
project.set_workflow('main', 'src/new_train_workflow.py')

## Running a pipeline

First run the following code to save your project:

In [17]:
project.save()

<mlrun.projects.project.MlrunProject at 0x7f573ab82f10>

Use the `run` MLRun project method to execute your workflow pipeline with Kubeflow Pipelines.

You can pass **`arguments`** or set the **`artifact_path`** to specify a unique path for storing the workflow artifacts.

In [18]:
run_id = project.run(
    'main',
    arguments={'vector_name':"transactions-fraud",
                'label_column':"labels.label",}, 
    dirty=True, watch=True)



ValueError: There are no functions in the project. Make sure you've set your functions with project.set_function().

![UI - WorkFlow](images/pipline-ui.png)

## Step 6: Test the model endpoint


Now that your model is deployed using the pipeline, you can invoke it as usual:

In [33]:
# Define your serving function
serving_fn = project.get_function('serving')

# Choose an id for your test
sample_id = 'C1000148617'
model_inference_path = '/v2/models/fraud/infer'

# Send our sample ID for predcition
serving_fn.invoke(path=model_inference_path,
                  body={'inputs': [[sample_id]]})

> 2023-06-25 07:41:57,968 [info] invoking function: {'method': 'POST', 'path': 'http://nuclio-fraud-demo-admin-serving.default-tenant.svc.cluster.local:8080/v2/models/fraud/infer'}


{'id': 'e9463e2e-5cff-4015-82e2-70594013b3f2',
 'model_name': 'fraud',
 'outputs': [0]}

## Done!

You've completed Part 2 of the model training with the feature store.
Proceed to part 5 to learn how to deploy and monitor the model.