Som/Null Backtest fold - best model fix and Foundational Model Retries#246
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds safeguards to prevent “Best_Model” selection from being biased by models that silently drop some backtest folds, and improves robustness around Chronos API calls and NULL handling.
Changes:
- Exclude partial-fold models from Best_Model eligibility and emit warnings / error when no complete models exist.
- Add retry + exponential backoff for transient Chronos API failures.
- Add testthat coverage reproducing the partial-fold Best_Model bug and asserting the corrected behavior.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/testthat/test-best_models.R | New integration-style tests that corrupt fold coverage to ensure Best_Model selection ignores partial models and errors when none are complete. |
| R/final_models.R | Detect incomplete fold coverage, warn, exclude partial models from Best_Model metric computation, and ensure Best_Model flags remain non-NA. |
| R/chronos_controller.R | Add retry loop with exponential backoff/jitter for transient Chronos API request failures. |
| R/agent_iterate_forecast.R | Extend null_converter() to treat NULL, empty, NA, and "NULL" consistently. |
| R/utility.R | Register fold_count as a global variable to satisfy R CMD check. |
| NAMESPACE | Remove S3 print registration for timesfm_model_fit_impl. |
| man/print.timesfm_model_fit_impl.Rd | Remove generated Rd for the removed/unsupported print method. |
Comments suppressed due to low confidence (1)
R/final_models.R:1
- The implementation and messaging say “back test fold coverage”, but
expected_fold_count <- length(train_test_id_list)appears to include both Back_Test and Future_Forecast IDs (per the earlier construction oftrain_test_id_list). That makes the ‘back test folds’ count (and missing fold IDs) potentially include the future forecast ID as well, and can exclude models for missing future output via the ‘back test coverage’ path. Consider computing coverage using only backtest IDs (e.g.,Train_Test_ID != 1) and handling missing-future separately (or rename the concept/message to ‘train/test IDs’ if the intent is to require both).
#' Final Models
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@copilot can you raise another PR for the suggestion you gave and also, look into the comments that i have for some suggestion |
|
@smdp2000ofc I've opened a new pull request, #247, to work on those changes. Once the pull request is ready, I'll request review from you. |
…, backtest-only fold coverage, retry tests Agent-Logs-Url: https://github.com/microsoft/finnts/sessions/fbb0b01c-29a0-4759-a221-760420544afa Co-authored-by: smdp2000ofc <236835326+smdp2000ofc@users.noreply.github.com>
…lumns Agent-Logs-Url: https://github.com/microsoft/finnts/sessions/fbb0b01c-29a0-4759-a221-760420544afa Co-authored-by: smdp2000ofc <236835326+smdp2000ofc@users.noreply.github.com>
Fix review comments: backtest-only fold coverage, null_converter safety, CHRONOS_MAX_RETRIES clarity, retry tests
… non-retryable paths Agent-Logs-Url: https://github.com/microsoft/finnts/sessions/e858482b-cb78-4128-8cc3-959ce788691c Co-authored-by: smdp2000ofc <236835326+smdp2000ofc@users.noreply.github.com>
|
Here's the review of commit What the commit does:
Issue found and fixed ( # before - missing return()
if (status >= 400 && status < 500 && status != 429L) {
parse_timesfm_response(response)
}
if (attempt > max_retries) {
parse_timesfm_response(response)
}These relied on Everything else looks good: retry counts are correct (1 original + 3 retries = 4 total), the backoff logic and jitter cap are consistent between Chronos and TimesFM, and the tests cover all key paths (connection errors, 429, 5xx, non-retryable 4xx, total attempt count). |
| back_test_id_list <- setdiff(train_test_id_list, 1) | ||
| expected_back_test_fold_count <- length(back_test_id_list) | ||
|
|
||
| if (expected_back_test_fold_count > 0) { |
There was a problem hiding this comment.
Is there any reason why there wouldn't be back test folds?
| } | ||
| } | ||
|
|
||
| complete_model_ids <- if (expected_back_test_fold_count > 0) { |
There was a problem hiding this comment.
again why check for no back test folds. In theory there should always be a back test
This pull request introduces several robustness and reliability improvements to the forecasting pipeline, focusing on handling partial model results, improving error handling and retries for API calls, and enhancing utility/test coverage. The changes ensure that only models with complete back test coverage are eligible for selection as the "Best Model," add exponential backoff for transient API failures, and improve utility function correctness and testing.
Model selection and back test coverage:
final_modelsfunction now checks that only models with complete back test fold coverage are eligible for "Best Model" selection. Models missing any back test folds are flagged and excluded from selection, but their partial results remain in the output for visibility. If no models have complete coverage, an error is raised. [1] [2]Best_Modelflag. [1] [2]Chronos API reliability:
send_chronos_requestfunction now retries on transient failures (HTTP 429, 5xx, or connection errors) using exponential backoff with jitter, up to a maximum number of attempts. This makes the system more resilient to temporary API issues. [1] [2]Utility and testing improvements:
null_converterutility function is improved to handle more edge cases (NULL, zero-length, "NULL" string, NA) and now passes through non-NULL values and vectors of length > 1. Comprehensive tests are added to ensure correct behavior. [1] [2]Documentation and namespace clean-up:
print.timesfm_model_fit_implis removed from the namespace, and its documentation file is deleted, reflecting deprecation or removal of this method. [1] [2]This pull request introduces several robustness and reliability improvements to the forecasting pipeline, particularly around handling incomplete model evaluations and improving resilience to external API failures. The most significant changes are enhanced error handling and retry logic for Chronos API requests, stricter enforcement of model selection criteria, and improved utility/test coverage for helper functions.Reliability and Robustness Improvements:
send_chronos_requestfunction to handle transient Chronos API failures (HTTP 429, 5xx, and connection errors), improving reliability of external API calls. [1] [2]final_modelsto ensure that only models with complete back test fold coverage are eligible for Best_Model selection, preventing partial models from being unfairly selected. Models with incomplete coverage are flagged and excluded from selection, but still included in outputs for transparency. [1] [2] [3]Utility and Testing Enhancements:
null_converterutility function to robustly handleNULL, zero-length,"NULL"string, andNAinputs, with comprehensive new unit tests added for all edge cases. [1] [2]utils::globalVariablesto include new variables used in back test coverage tracking, preventing unnecessary warnings.Documentation and Namespace Clean-up:
print.timesfm_model_fit_impl, reflecting its deprecation or removal. [1] [2]These changes collectively harden the pipeline against transient failures, enforce fair model evaluation, and improve code maintainability and test coverage.This pull request introduces multiple improvements to model selection robustness, error handling, and testing in the forecasting pipeline. The main focus is to ensure that only models with complete back test coverage are eligible to win Best_Model, preventing models with missing folds (often due to transient API failures) from being unfairly selected. In addition, the Chronos API client now has robust retry logic, and comprehensive tests have been added to verify the new behavior.
Model selection robustness and fairness:
final_models(), models with incomplete back test fold coverage are now identified and excluded from Best_Model selection. Partial models are flagged with warnings, and only models with predictions for all folds are eligible. If no model has complete coverage, an error is raised. This prevents models that failed on some folds from winning due to being evaluated on a smaller, easier sample. [1] [2] [3]Chronos API reliability:
send_chronos_request) now implements exponential backoff with jitter for transient failures (HTTP 429, 5xx, or connection errors), retrying up to a configurable maximum. This greatly improves robustness to intermittent API issues. [1] [2]Testing and utility updates:
fold_counttoglobalVariables, removed the unusedprint.timesfm_model_fit_implalias and documentation, and improved thenull_converterfunction for robustness. [1] [2] [3] [4]These changes collectively ensure fairer model evaluation, more robust API interactions, and improved test coverage for critical selection logic.
This pull request introduces robust handling for transient API failures, improves the fairness and reliability of model selection, and strengthens utility/test coverage. The main changes are grouped below.
Resilient API Request Handling:
send_chronos_requestandsend_timesfm_request, retrying up to three times on transient errors (HTTP 429, 5xx, or connection issues) to improve reliability when communicating with Chronos and TimesFM APIs. [1] [2] [3] [4]Fair Model Selection and Evaluation:
final_models, models that do not produce predictions for all expected back test folds are now detected and excluded from Best_Model selection, with detailed warnings and partial results retained for transparency. This prevents models with incomplete evaluations from being unfairly favored. [1] [2]Utility and Test Improvements:
null_converterutility to robustly handle a wider range of null-like inputs, and added comprehensive unit tests to verify its behavior. [1] [2]Documentation and Cleanup:
timesfm_model_fit_implfrom the NAMESPACE and deleted its documentation, as it is no longer needed. [1] [2]These changes collectively make the modeling pipeline more robust, transparent, and maintainable.