##  Lifecycle - 1

### Run a Training and Evaluation Job

In this step, we will manually trigger a model training and evaluation process. This trigger might be required for reasons such as model updates or changes in production data.

Through the previous workflow, we created an **Argo Workflow template** named `train-model.yaml`, which handles both training and evaluating the model.

Let’s break down the key steps involved in this process:

1. **Template Parameters**:
   The template accepts three Public IP addresses as parameters:

   * **train-ip**: The IP address for the Training Endpoint, responsible for triggering the model's training process and logging artifacts in **MLFlow**.
   * **eval-ip**: The IP address for the Model Evaluation Endpoint, responsible for evaluating the model and registering it in **MLFlow** under a specific name.
   * **mlflow-ip**: The IP address where MLFlow is accessible. This will be important for interacting with MLFlow during the training and evaluation process.

2. **Training the Model**:
   The training endpoint is triggered via an API call. The following code initiates the training process and logs the model artifacts:

   ```bash
   RESPONSE=$(curl -f -s -X POST "http://{{inputs.parameters.train-ip}}:9090/train?model_name=resnet50&data_source=train")
   CURL_EXIT_CODE=$?
   echo "[INFO] Training endpoint response was: $RESPONSE" >&2
   if [ $CURL_EXIT_CODE -ne 0 ]; then
     echo "[ERROR] curl failed with code $CURL_EXIT_CODE" >&2
     exit $CURL_EXIT_CODE
   fi
   ```

   Note that training can take a significant amount of time. The endpoint returns a **RUN\_ID**, which is logged in **MLFlow**. Due to the training duration, HTTP endpoints have a timeout, but the **RUN\_ID** is returned immediately to track the progress.

3. **Polling for Training Completion**:
   Since model training can take time, we continually poll the MLFlow API using the `RUN_ID` to track the status of the training job. The following code checks the training status:

   ```bash
   RUN_ID=$(echo "$RESPONSE" | jq -r '.run_id')
   if [ -z "$RUN_ID" ]; then
     echo "[ERROR] run_id not found in response" >&2
     exit 1
   fi
   echo "[INFO] MLflow run ID: $RUN_ID" >&2

   TERMINAL="FINISHED|FAILED|KILLED"
   while true; do
     STATUS=$(curl -s "http://{{inputs.parameters.mlflow-ip}}:8000/api/2.0/mlflow/runs/get?run_id=${RUN_ID}" | jq -r '.run.info.status')
     echo "[INFO] Run ${RUN_ID} status: ${STATUS}" >&2
     case "$STATUS" in
       FINISHED|FAILED|KILLED)
         echo "[INFO] Terminal state reached: $STATUS" >&2
         break
         ;;
     esac
     sleep 10
   done
   ```

4. **Model Evaluation and Registration**:
   After training, the model must be evaluated. We trigger the **Model Evaluation Endpoint** to evaluate the model and register it in **MLFlow**. The version of the registered model is extracted and logged:

   ```bash
   EVAL_RESPONSE=$(curl -f -s -X GET "http://{{inputs.parameters.eval-ip}}:8080/get-version?run_id=${RUN_ID}")
   CURL_EXIT_CODE=$?
   echo "[INFO] Evaluation endpoint response was: $EVAL_RESPONSE" >&2
   if [ $CURL_EXIT_CODE -ne 0 ]; then
     echo "[ERROR] curl failed with code $CURL_EXIT_CODE" >&2
     exit $CURL_EXIT_CODE
   fi

   VERSION=$(echo "$EVAL_RESPONSE" | jq -r '.new_model_version // empty')
   if [ -z "$VERSION" ]; then
     echo "[WARN] 'new_model_version' not found in response." >&2
     exit 1
   fi
   echo -n "$VERSION"
   ```

5. **Triggering the Workflow**:
   To trigger the workflow, navigate to **Argo Workflows** > **Workflow Templates** > **Submit**, then provide the necessary IPs for **train-ip**, **eval-ip**, and **mlflow-ip**, and hit **Submit**.

6. **Container Build**:
   Once the model is registered, we trigger a **container build** automatically upon receiving the new model version from the training job. After the build completes, the latest model will be accessible via the endpoint:

   ```bash
   http://A.B.C.D:8081
   ```

This flow ensures that the latest model is trained, evaluated, and registered. The **FastAPI** wrapper for this model will then replace the existing model (`bird.pth`) with the newly trained one, making it available to users.

Next, we’ll explore how this flow integrates into the system.
