Skip to content

Som/Null Backtest fold - best model fix and Foundational Model Retries#246

Merged
smdp2000ofc merged 8 commits into
mainfrom
som/chronos-best-model-fix
Apr 30, 2026
Merged

Som/Null Backtest fold - best model fix and Foundational Model Retries#246
smdp2000ofc merged 8 commits into
mainfrom
som/chronos-best-model-fix

Conversation

@smdp2000ofc
Copy link
Copy Markdown
Collaborator

@smdp2000ofc smdp2000ofc commented Apr 30, 2026

This pull request introduces several robustness and reliability improvements to the forecasting pipeline, focusing on handling partial model results, improving error handling and retries for API calls, and enhancing utility/test coverage. The changes ensure that only models with complete back test coverage are eligible for selection as the "Best Model," add exponential backoff for transient API failures, and improve utility function correctness and testing.

Model selection and back test coverage:

  • The final_models function now checks that only models with complete back test fold coverage are eligible for "Best Model" selection. Models missing any back test folds are flagged and excluded from selection, but their partial results remain in the output for visibility. If no models have complete coverage, an error is raised. [1] [2]
  • The logic for building the allow-list of eligible models for "Best Model" selection is updated to include only complete models and valid averages. The output contract is preserved so every model row has a non-NA Best_Model flag. [1] [2]

Chronos API reliability:

  • The send_chronos_request function now retries on transient failures (HTTP 429, 5xx, or connection errors) using exponential backoff with jitter, up to a maximum number of attempts. This makes the system more resilient to temporary API issues. [1] [2]

Utility and testing improvements:

  • The null_converter utility function is improved to handle more edge cases (NULL, zero-length, "NULL" string, NA) and now passes through non-NULL values and vectors of length > 1. Comprehensive tests are added to ensure correct behavior. [1] [2]
  • Additional variables are registered as global to avoid R CMD check warnings.

Documentation and namespace clean-up:

  • The S3 method registration for print.timesfm_model_fit_impl is removed from the namespace, and its documentation file is deleted, reflecting deprecation or removal of this method. [1] [2]This pull request introduces several robustness and reliability improvements to the forecasting pipeline, particularly around handling incomplete model evaluations and improving resilience to external API failures. The most significant changes are enhanced error handling and retry logic for Chronos API requests, stricter enforcement of model selection criteria, and improved utility/test coverage for helper functions.

Reliability and Robustness Improvements:

  • Added exponential backoff and retry logic (up to 3 retries) to the send_chronos_request function to handle transient Chronos API failures (HTTP 429, 5xx, and connection errors), improving reliability of external API calls. [1] [2]
  • Implemented stricter checks in final_models to ensure that only models with complete back test fold coverage are eligible for Best_Model selection, preventing partial models from being unfairly selected. Models with incomplete coverage are flagged and excluded from selection, but still included in outputs for transparency. [1] [2] [3]
  • Enhanced the Best_Model output logic so that every model in the output (including partial-coverage models) is flagged as either "Yes" or "No" for Best_Model, maintaining contract consistency.

Utility and Testing Enhancements:

  • Improved the null_converter utility function to robustly handle NULL, zero-length, "NULL" string, and NA inputs, with comprehensive new unit tests added for all edge cases. [1] [2]
  • Updated utils::globalVariables to include new variables used in back test coverage tracking, preventing unnecessary warnings.

Documentation and Namespace Clean-up:

  • Removed the S3 method registration and documentation for print.timesfm_model_fit_impl, reflecting its deprecation or removal. [1] [2]

These changes collectively harden the pipeline against transient failures, enforce fair model evaluation, and improve code maintainability and test coverage.This pull request introduces multiple improvements to model selection robustness, error handling, and testing in the forecasting pipeline. The main focus is to ensure that only models with complete back test coverage are eligible to win Best_Model, preventing models with missing folds (often due to transient API failures) from being unfairly selected. In addition, the Chronos API client now has robust retry logic, and comprehensive tests have been added to verify the new behavior.

Model selection robustness and fairness:

  • In final_models(), models with incomplete back test fold coverage are now identified and excluded from Best_Model selection. Partial models are flagged with warnings, and only models with predictions for all folds are eligible. If no model has complete coverage, an error is raised. This prevents models that failed on some folds from winning due to being evaluated on a smaller, easier sample. [1] [2] [3]
  • The output now ensures every model row (including partials) is present, but only complete models can be flagged as Best_Model, maintaining full visibility and backward compatibility.

Chronos API reliability:

  • The Chronos API client (send_chronos_request) now implements exponential backoff with jitter for transient failures (HTTP 429, 5xx, or connection errors), retrying up to a configurable maximum. This greatly improves robustness to intermittent API issues. [1] [2]

Testing and utility updates:

  • Extensive tests were added to verify that models with partial back test coverage cannot win Best_Model, that errors are raised if no model has full coverage, and that selection metrics are computed only from complete models.
  • Utility and documentation updates: Added fold_count to globalVariables, removed the unused print.timesfm_model_fit_impl alias and documentation, and improved the null_converter function for robustness. [1] [2] [3] [4]

These changes collectively ensure fairer model evaluation, more robust API interactions, and improved test coverage for critical selection logic.
This pull request introduces robust handling for transient API failures, improves the fairness and reliability of model selection, and strengthens utility/test coverage. The main changes are grouped below.

Resilient API Request Handling:

  • Added exponential backoff with jitter and retry logic to both send_chronos_request and send_timesfm_request, retrying up to three times on transient errors (HTTP 429, 5xx, or connection issues) to improve reliability when communicating with Chronos and TimesFM APIs. [1] [2] [3] [4]

Fair Model Selection and Evaluation:

  • In final_models, models that do not produce predictions for all expected back test folds are now detected and excluded from Best_Model selection, with detailed warnings and partial results retained for transparency. This prevents models with incomplete evaluations from being unfairly favored. [1] [2]
  • Ensured that only eligible models (those with complete back test coverage and valid averages) are considered for Best_Model selection, and that all models (including partials) are reported in outputs with appropriate flags. [1] [2]

Utility and Test Improvements:

  • Improved the null_converter utility to robustly handle a wider range of null-like inputs, and added comprehensive unit tests to verify its behavior. [1] [2]
  • Registered new global variables to avoid R CMD check warnings for new columns used in model coverage tracking.

Documentation and Cleanup:

  • Removed the print method for timesfm_model_fit_impl from the NAMESPACE and deleted its documentation, as it is no longer needed. [1] [2]

These changes collectively make the modeling pipeline more robust, transparent, and maintainable.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds safeguards to prevent “Best_Model” selection from being biased by models that silently drop some backtest folds, and improves robustness around Chronos API calls and NULL handling.

Changes:

  • Exclude partial-fold models from Best_Model eligibility and emit warnings / error when no complete models exist.
  • Add retry + exponential backoff for transient Chronos API failures.
  • Add testthat coverage reproducing the partial-fold Best_Model bug and asserting the corrected behavior.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tests/testthat/test-best_models.R New integration-style tests that corrupt fold coverage to ensure Best_Model selection ignores partial models and errors when none are complete.
R/final_models.R Detect incomplete fold coverage, warn, exclude partial models from Best_Model metric computation, and ensure Best_Model flags remain non-NA.
R/chronos_controller.R Add retry loop with exponential backoff/jitter for transient Chronos API request failures.
R/agent_iterate_forecast.R Extend null_converter() to treat NULL, empty, NA, and "NULL" consistently.
R/utility.R Register fold_count as a global variable to satisfy R CMD check.
NAMESPACE Remove S3 print registration for timesfm_model_fit_impl.
man/print.timesfm_model_fit_impl.Rd Remove generated Rd for the removed/unsupported print method.
Comments suppressed due to low confidence (1)

R/final_models.R:1

  • The implementation and messaging say “back test fold coverage”, but expected_fold_count <- length(train_test_id_list) appears to include both Back_Test and Future_Forecast IDs (per the earlier construction of train_test_id_list). That makes the ‘back test folds’ count (and missing fold IDs) potentially include the future forecast ID as well, and can exclude models for missing future output via the ‘back test coverage’ path. Consider computing coverage using only backtest IDs (e.g., Train_Test_ID != 1) and handling missing-future separately (or rename the concept/message to ‘train/test IDs’ if the intent is to require both).
#' Final Models

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread R/agent_iterate_forecast.R
Comment thread R/chronos_controller.R
Comment thread R/chronos_controller.R
Comment thread R/chronos_controller.R
Comment thread R/chronos_controller.R
Comment thread R/final_models.R Outdated
Comment thread R/final_models.R Outdated
Comment thread R/final_models.R Outdated
Comment thread R/chronos_controller.R
@smdp2000ofc
Copy link
Copy Markdown
Collaborator Author

@copilot can you raise another PR for the suggestion you gave and also, look into the comments that i have for some suggestion

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 30, 2026

@smdp2000ofc I've opened a new pull request, #247, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot AI and others added 3 commits April 30, 2026 17:38
…, backtest-only fold coverage, retry tests

Agent-Logs-Url: https://github.com/microsoft/finnts/sessions/fbb0b01c-29a0-4759-a221-760420544afa

Co-authored-by: smdp2000ofc <236835326+smdp2000ofc@users.noreply.github.com>
…lumns

Agent-Logs-Url: https://github.com/microsoft/finnts/sessions/fbb0b01c-29a0-4759-a221-760420544afa

Co-authored-by: smdp2000ofc <236835326+smdp2000ofc@users.noreply.github.com>
Fix review comments: backtest-only fold coverage, null_converter safety, CHRONOS_MAX_RETRIES clarity, retry tests
@smdp2000ofc smdp2000ofc changed the title Null Backtest fold, best model fix Null Backtest fold, best model fix and Foundational Model Retries Apr 30, 2026
@smdp2000ofc smdp2000ofc changed the title Null Backtest fold, best model fix and Foundational Model Retries Null Backtest fold - best model fix and Foundational Model Retries Apr 30, 2026
@smdp2000ofc
Copy link
Copy Markdown
Collaborator Author

@copilot can u review just last commit -
4153a12

… non-retryable paths

Agent-Logs-Url: https://github.com/microsoft/finnts/sessions/e858482b-cb78-4128-8cc3-959ce788691c

Co-authored-by: smdp2000ofc <236835326+smdp2000ofc@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 30, 2026

@copilot can u review just last commit -
4153a12

Here's the review of commit 4153a12:

What the commit does:

  • Renames backoff_waitchronos_backoff_wait in chronos_controller.R (clean, avoids potential name collisions)
  • Adds retry logic to send_timesfm_request in timesfm_model.R, mirroring the existing Chronos retry pattern (TIMESFM_MAX_RETRIES = 3, total 4 attempts)
  • Updates Chronos tests and adds matching tests for TimesFM retry scenarios

Issue found and fixed (bbe9cff):
Both send_timesfm_request (new) and send_chronos_request (pre-existing) were missing return() on the non-retryable 4xx path and the max-retries-exceeded path:

# before - missing return()
if (status >= 400 && status < 500 && status != 429L) {
  parse_timesfm_response(response)
}
if (attempt > max_retries) {
  parse_timesfm_response(response)
}

These relied on parse_*_response always calling stop() for non-2xx, which worked but was fragile and inconsistent. Fixed with explicit return() in both functions.

Everything else looks good: retry counts are correct (1 original + 3 retries = 4 total), the backoff logic and jitter cap are consistent between Chronos and TimesFM, and the tests cover all key paths (connection errors, 429, 5xx, non-retryable 4xx, total attempt count).

@smdp2000ofc smdp2000ofc changed the title Null Backtest fold - best model fix and Foundational Model Retries Som/Null Backtest fold - best model fix and Foundational Model Retries Apr 30, 2026
@smdp2000ofc smdp2000ofc requested a review from mitokic April 30, 2026 20:05
Comment thread R/final_models.R
back_test_id_list <- setdiff(train_test_id_list, 1)
expected_back_test_fold_count <- length(back_test_id_list)

if (expected_back_test_fold_count > 0) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason why there wouldn't be back test folds?

Comment thread R/final_models.R
}
}

complete_model_ids <- if (expected_back_test_fold_count > 0) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again why check for no back test folds. In theory there should always be a back test

@mitokic mitokic self-requested a review April 30, 2026 22:21
@smdp2000ofc smdp2000ofc merged commit f70086a into main Apr 30, 2026
3 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants