write initial valid metrics from t2 clients to tensorboard #3

holgerroth · 2022-11-08T15:59:53Z

Adds tensorboard writing to the IntimeModelSelector on the hub server.
This will write the initial validation metric (computed in the beginning of each round) to a file in the job workspace of the hub server.

Note, here the initial metrics and t2 client names are sent as part of the shareable and collected in the T2 servers (the aggregation class of the connectors). Let me know if you see any concerns.

* change README to remove Quick start, reduce POC and other in quick start move feature highlights in release node README redesign * UPDATE README.md * UPDATE README.md * UPDATE README.md * UPDATE README.md * UPDATE README.md * UPDATE README.md * UPDATE README.md * Change to use new FLARE API * Change to use new FLARE API * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * update * update * update * Add notes for traditional ML and FedSM (#2) * Update README.md * Update README.md * update readme (#3) * update readme * update * more updates * Update README.md * Update README.md * Update release_notes.md * Update release_notes.md * minor text edit * minor text edit --------- Co-authored-by: Ziyue Xu <71786575+ZiyueXu77@users.noreply.github.com> Co-authored-by: Holger Roth <6304754+holgerroth@users.noreply.github.com>

* Multi-process worker integration with FCI cellnet (NVIDIA#1393) * FL server integrate with FCI cellnet. * Codestyle reformat. * fed_server_test.py integrate with Cellnet. * FL client integrate with Cellnet. Client register to server. * Reformat codestyle. * reformat codestyle. * update the message. * Codestyle reformat. * Fixed the import sort. * Addes the PR reviews. * codestyle fix. * made the cell_timeout configurable. * codestyle fix. * removed no use import. * disable simulator_runner_test temporary. * FCI integration for job run. * Fix for the admin auto login. * reformat codestyle. * moved create_admin_server() to utils. * removed the no use import. * sort import. * Changes after the PR reviews. * rolled back the change dh_psi_test.py. * Removed no use import. * PR review changes. * type hint change. * FCI integration multi-gpu changes. * PR review changes. --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * Quick start [skip ci] (NVIDIA#1385) * update quick start * rewrite quick start * quick start guide * update README.md * update README.md * update README.md * update README.md * update README.md * update README.md * update README.md * update README.md * update README.md * update README.md * update README.md * update * update * update * 1. Move mission after NVFLARE section 2. set the workspace to be job specific 3. make production model section collapsible * 1. Move mission after NVFLARE section 2. set the workspace to be job specific 3. make production model section collapsible * 1. Move mission after NVFLARE section 2. set the workspace to be job specific 3. make production model section collapsible * 1. Move mission after NVFLARE section 2. set the workspace to be job specific 3. make production model section collapsible * Update based PR comments * update * Move HE from app_common to app_opt. Update app_opt requirements (NVIDIA#1392) * Move HE from app_common to app_opt. Update app_opt requirements management. * Address comments * Fix github premerge * Change all to dev for test env. * Use scikit-learn instead of sklearn * Fix circular import * Fix typo * Use dev for test env. * Fix in time model selector (NVIDIA#1401) * Use get cookie instead of get header to get CONTRIBUTION_ROUND * Fix intime model selector issue * Fix HE imports * Fix unit test * Simulator integration with FCI Cellnet (NVIDIA#1398) * simulator integrate with FCI. * codestyle reformat. * Removed the no use import. * Removed no use import. * PR reviews change. * rolled back a change. * Refactored. * Removed no use import. * sort import. --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * Add limit to number of jobs in list_jobs and options to flare_api (NVIDIA#1381) * Add limit to number of jobs in list_jobs and options to flare_api * remove print * Remove print Remove print statement that should not be there --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Fixed close_cb bug and added socket cleanup (NVIDIA#1399) * Merged async TCP driver to dev (NVIDIA#1397) * fix new_insecure_session (NVIDIA#1403) * Update SKLearn readmes and refactor SKLearnExecutor [skip ci] (NVIDIA#1388) * update readmes and refactor SKLearnExecutor add SVC link update return type hints and readme * update type hint * Merged async UDS driver to dev (NVIDIA#1404) * add auc log (NVIDIA#1406) add Homogeneity log * update README for hello-pt on model initialization [skip ci] (NVIDIA#1402) * update README for hello-pt on model initialization * update README for hello-pt on model initialization * update README for hello-pt on model initialization * update README for hello-pt on model initialization * update README for hello-pt on model initialization * update README.md --------- Co-authored-by: chesterc <n9Z0GoPp5u1Y> * Graceful cell stop (NVIDIA#1405) * help graceful cell closing and shutdown * reformat * no need to join daemon thread --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Ha fix (NVIDIA#1407) * Fixes for HA. * codestyle fix. --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * update README.md (NVIDIA#1408) Co-authored-by: chesterc <n9Z0GoPp5u1Y> * README Update for PSI [skip ci] (NVIDIA#1409) * update README.md * update README.md --------- Co-authored-by: chesterc <n9Z0GoPp5u1Y> * update README 3 [skip ci] (NVIDIA#1410) * update README.md * update README.md * update README.md * update README.md --------- Co-authored-by: chesterc <n9Z0GoPp5u1Y> * Add note for brats18 data access (NVIDIA#1245) * Add a note for brats18 dataset and fix a bug in prostate example * reorganize folder * reorganize folder * update brats link * Readme 4 [skip ci] (NVIDIA#1413) * fix some sentence * fix some sentence * formatting changes * formatting changes * update PSI image and README.md * update PSI image and README.md * update PSI image and README.md * update PSI image and README.md --------- Co-authored-by: chesterc <n9Z0GoPp5u1Y> * Fix integration tests (NVIDIA#1370) * Fix integration tests * Fix dummy yaml * fix yaml * clean up workspace * use secure mode * Increase buffer size * Try not start server * raise exception if things go wrong * Read more lines * Debugging * Use subprocess * Use subprocess as default rather than pty * To be consistent with CI env * Fix admin console test * Update run_integration_tests.sh --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * 1. Added remove_endpoint() (NVIDIA#1414) 2. Unified to max message size to 2GB 3. Fixed the deleting socket file problem. * RESTORE Old README before Release [skip ci] (NVIDIA#1418) * update README.md * update README.md * update README.md * update README.md * RESTORE OLD README before release --------- Co-authored-by: chesterc <n9Z0GoPp5u1Y> * update fl context to sync correctly; make current round sticky in SaG workflow (NVIDIA#1400) update unit test * Randomize azure client resource group (NVIDIA#1419) * Enahce Simulator to avoid the Cell Error at end run. (NVIDIA#1421) * Hide cell cmds (NVIDIA#1420) * hide cell commands * changed for_test to diagnose --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Changed the fetch_task fetch_again without delay. (NVIDIA#1423) * fix default order of jobs in list_jobs command (NVIDIA#1416) * fix default order of jobs in list_jobs command * revise behavior of list_jobs * fix ci * Add back the SimulatorRunner (NVIDIA#1425) * Add required stuff back * Fix year * Move virtual env of all examples to main folder (NVIDIA#1411) * Move virtual env of all examples to main folder * Reverse change to cifar * Remove venv prefix --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * Move CIFAR10 example and update CI tests (NVIDIA#1415) * Add debug mode to ci (NVIDIA#1428) * Add debug mode to ci * Undo other changes * Restructure hello-world examples to standardize for tests (NVIDIA#1412) * restructure hello-world examples to standardize for tests * update tests with new locations of jobs * rename job_configs directory to jobs * add missed rename * Move TBReceiver to experiment tracking (NVIDIA#1424) * Move TBReceiver to experiment tracking * Move job_configs to jobs * Change tensorflow to tensorboard * Use setup steps in CI * Update setup.cfg * Add __init__.py to decomposers folder so build system will include it. (NVIDIA#1430) * Add messages at the end of cloud launch scripts so (NVIDIA#1432) users know how to delete the resource group / terminate the EC2 instance * UPDATE PSI README.md (NVIDIA#1434) * Avoid the simulator cell error after END_RUN. (NVIDIA#1431) * Cleaned up logs (NVIDIA#1426) * Enable Simulator to use resources.json. (NVIDIA#1435) * Enable Simulator to use resources.json. * update log. * Fix list jobs command argument parsing bug (NVIDIA#1427) * Fix list job bug * Keep the default behavior the same as 2.2 * Fix CI issue --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Fixed the simulator hang due to missing import. (NVIDIA#1436) * Fixed the simulator hang due to missing import. * Added log for the error. * Removed commented out code. --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * Switch to --use-device-code for all az login cases (NVIDIA#1437) * update nightly build version (NVIDIA#1439) * Enhance the job run process not to kill its own process, instead let … (NVIDIA#1440) * Enhance the job run process not to kill its own process, instead let it to MPM manage. * refactored. --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * Remove unused codes (NVIDIA#1442) * Fixed a few QA bugs. (NVIDIA#1445) * Random forest update (NVIDIA#1441) * Fix SnG workflow allowing empty global model for random forest and xgboost * Fix SnG workflow allowing empty global model for random forest and xgboost * Reverse error in auto refectoring * Move the allow empty check * Update readme * Update util functions and folder names * Update util functions and folder names * Add model validation script and results * change server json * Improve POC shutdown (NVIDIA#1438) Change to use new FLARE API remove print statement use insecure_session fix formatting issue Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * Randomize resource group to avoid duplicate resource group names (NVIDIA#1450) * Added more detail when recursive data is found in FOBS (NVIDIA#1448) * Fixed the QA test recursive ref issue. (NVIDIA#1451) * Fixed the issue job status not updated to exception when controller e… (NVIDIA#1447) * Fixed the issue job status not updated to exception when controller exception. * Added a job_id in runner_process check. * removed comment out codes. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Update integration tests; Add test config auto generation code (NVIDIA#1446) * Update integration tests; Add test config auto generation code * Remove files that should not be checked in * add more options for ci script * Fix handling admin_api response * Update tb streaming test * Shorten ci premerge * Remove unused dependecies * Change test_diff_job_config from POC to HA for clean shutdown --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * README redesign [skip ci] (NVIDIA#1449) * change README to remove Quick start, reduce POC and other in quick start move feature highlights in release node README redesign * UPDATE README.md * UPDATE README.md * UPDATE README.md * UPDATE README.md * UPDATE README.md * UPDATE README.md * UPDATE README.md * Change to use new FLARE API * Change to use new FLARE API * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * update * update * update * Add notes for traditional ML and FedSM (#2) * Update README.md * Update README.md * update readme (#3) * update readme * update * more updates * Update README.md * Update README.md * Update release_notes.md * Update release_notes.md * minor text edit * minor text edit --------- Co-authored-by: Ziyue Xu <71786575+ZiyueXu77@users.noreply.github.com> Co-authored-by: Holger Roth <6304754+holgerroth@users.noreply.github.com> * Check if resource group exists. If yes, reuse it. (NVIDIA#1456) * Move split learning to advanced examples; update release notes (NVIDIA#1457) * Fix admin API issues and support optional messages (NVIDIA#1458) * fix qa issues * reformat * restor executable scripts (NVIDIA#1460) * Fix jupyter notebook FLARE API path issue (NVIDIA#1462) Add codes to set username in jupyter notebook at provisioning time * Silent Reconnect (NVIDIA#1463) * Added more detail when recursive data is found in FOBS * Added silent reconnect * fix shutdown log messages (NVIDIA#1465) Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Add FedSM example (NVIDIA#562) * Initial deposit for unorganized FedSM example * update readme for FedSM * update config for FedSM * update config for FedSM * update config for FedSM * move fedsm from example to research * change fedsm to use simulator * change fedsm to use simulator * Format compliance * Format compliance * Fully functional FedSM * 3-client version for stable simulation without error * Update tb record plot and testing scripts * Update tb record plot and testing scripts * Code update * fix typos; add citaton * Update readme correct num_clients and datapath * Code update to reflect the latest reviews * Update to reflect suggestions * Update global best model saving and testing scripts * Update global best model saving and testing scripts * Update readme and remove single-line scripts * Update to reflect comments * code refactor and corrections * code refactor for new dev branch * change jobs folder name * latest communication pattern * update the learnable pattern * Remove after train validation * Update to reflect the results under latest dev branch * Add testing results and update curve * Update config, plot, requirement, and readme * Update readme --------- Co-authored-by: Holger Roth <hroth@nvidia.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Randomize security group in AWS client scripts (NVIDIA#1467) * Fix cifar and auto generated integration tests (NVIDIA#1455) * Update Federated Stats to follow the new example structure [skip ci] (NVIDIA#1464) * 1. restructure and example to the standard format : add prepare_data.sh 2. update README.md (due to example structure changes) * 1. restructure and example to the standard format : add prepare_data.sh 2. update README.md (due to example structure changes) * cleanup * cleanup * update Image_stats job as well * restore the original version * remove invalid tests * restructure research folders (NVIDIA#1469) follow template requirements section fix typo restor xgboost example reword * fixed peer context handling in aux runner (NVIDIA#1470) * fixed peer context handling in aux runner * remove unused import * convert PSI to the standard test structure (NVIDIA#1468) * Update docs to have release notes in whats new, new glossary, fixes [skip ci] (NVIDIA#1461) * Update docs to have release notes in whats new, new glossary, fixes * Fix issue with jquery not being available in built docs * address PR comments, link to previous versions of examples, further additions --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Server Listens on All Interfaces (NVIDIA#1471) * fix configuration for readthedocs to build docs with new requirements (NVIDIA#1472) * Fix fl context prop (NVIDIA#1474) * Fix fl context prop * Change to sticky * Fixed exception in list_jobs (NVIDIA#1473) * Added more detail when recursive data is found in FOBS * Fixed exception in list_jobs when no jobs --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Fixed the simulator threads option for multi_gpu case. (NVIDIA#1476) * Change Azure VM create to remove warning (NVIDIA#1477) Show Azure VM login info * cleanup error msg; fix sag wait; fix get_task timeout (NVIDIA#1479) * cleanup error msg; fix sag wait; fix get_task timeout * update test case * Fix job runner multiple start issue (NVIDIA#1466) * Start job runner when server is turn to hot * undo changes * Address comments * Address comments * Address comments * Get rid of hello-examples warnings (NVIDIA#1475) * Get rid of hello-examples warnings * Fix import --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Update protobuf version (NVIDIA#1478) Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Configuration exception handling (NVIDIA#1480) * WIP * fix exception swallow bugs * fix exception swallow bugs * 1) change definition is_class to have path argument with class_path, no need for "args" 2) add more unit tests for no argument case 3) fix test case failure for python3.10 where the failure message changes from version less than 3.9 4) restore example config formatting. * fix api status and dead job message (NVIDIA#1484) * fix list_job in flare api (NVIDIA#1487) * protect server state against multiple state changes (NVIDIA#1489) * Fix loading conf in aws scripts (NVIDIA#1488) Add early stop on error cases in aws * add wait_for_system_shutdown [skip ci] (NVIDIA#1481) * add wait_for_system_shutdown * add wait_for_system_shutdown * add wait_for_system_shutdown * handle code with N answer * Add Jupyter-Lab notebooks [skip ci] (NVIDIA#1482) * Add Jupyter-Lab notebooks 1) getting_started.ipynb 2) install_in_container.ipynb 3) data_frame_fed_stats.ipynb 4) readme update for df_stats * add POC notebook and POC run * add new Notebooks * clean up * 1. clean up 2. remove install_in_container.ipynb * 1. clean up 2. remove install_in_container.ipynb 3. remove exmaples notebook * 1. clean up 2. remove install_in_container.ipynb 3. remove exmaples notebook * update * Fix controller timing issue (NVIDIA#1459) * Fix many build warning and issues, more documentation additions [skip ci] (NVIDIA#1486) * fix many build warning and issues, more documentation additions * fix ci --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Splitnn fix (NVIDIA#1485) * fix paths in split learning example add new line in configs * fix circular import * new line --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * minor fixes (NVIDIA#1495) * fix job listing (NVIDIA#1496) * fix job listing * updated test cases * improve authz user print format * Add user guide on cloud deployment (NVIDIA#1497) * add back sections for migrating that were removed (NVIDIA#1498) Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * simulator create the clients in parallel. (NVIDIA#1491) * simulator create the clients in parallel. * Changed to use threadpool to create the clients in parallel. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * Add notebooks for traditional ml examples (NVIDIA#1483) * add notebook for kmeans example * update kmeans notebook * update kmeans notebook * update kmeans notebook * update kmeans notebook * update kmeans notebook * update kmeans notebook --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * use common JupyterLab instructions (NVIDIA#1499) update link update readme restore getting started notebook delete some output * fix wf task exit status handling (NVIDIA#1494) * fix wf task exit status handling * fix dead client detection * update what's new (NVIDIA#1502) * MONAI example updates (NVIDIA#1506) * update instructions & paths * remove virtualenv folders * fix links * Add check on az login exit code (NVIDIA#1504) Add check on derived location and specified location * silent abort message logging (NVIDIA#1505) * fix listjobs detail handling (NVIDIA#1503) * not creating internal listener for the job cell. (NVIDIA#1507) * not creating internal listener for the job cell. * create the client internal listener for multi-gpu case. * Update README, Notebook, Fed Stats fix (NVIDIA#1501) * 1. notebook and fed status and README.md * update * update * fix typo * rm unnecessary virtualenv folder (NVIDIA#1512) * Ensure the start_run event for sub_worker_process. (NVIDIA#1514) * Remove things in __init__.py in app_opt (NVIDIA#1508) * Add notebooks for other machine learning methods (NVIDIA#1500) * add notebook for random forest * add notebook for random forest * add notebook for random forest * update readme for random forest * add linear model * add linear model * add linear model * add svm model * add xgboost tree model * add xgboost tree model * add notebook for xgboost tree * add notebook for xgboost histogram * correction to xgboost sharable generator and executor * correction to xgboost sharable generator and executor * correction to xgboost sharable generator and executor * rename job_configs to jobs * remove notebook outputs --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * More docs additions and fixes [skip ci] (NVIDIA#1510) * add notebook for simulator and other docs additions and fixes * remove notebook for simulator and add logging configuration page and info for migrating to 2.3 * Hello World Notebook (New) [skip ci] (NVIDIA#1518) * Added more detail when recursive data is found in FOBS * Added hello-world notebook * Removed N >= 2 * Fix auth test (NVIDIA#1519) * CIFAR-10 Auto-FedRL example (NVIDIA#1283) * Try to fix unsigned commits * refactor ScatterAndGatherAutoFedRL using python inheritance update path to accommodate latest nvflare change Note to TODO add license and update README * remove virtualenv folder * add reproduced results on cifar-10 clean code clean code * remove decomposers in PR add more exp details to README * pt_decomposers -> decomposers * add more util details remove nvflare from req file job_configs -> jobs * correct typo and add nvflare req --------- Co-authored-by: Pengfei Guo <pengfeig@nvidia.com> Co-authored-by: Pengfei Guo <32000655+guopengf@users.noreply.github.com> * Limit the ip address range of inbound ssh to creator's public ip only * Add one FAQ item to describe DNS cache/propagation and how to resolve it * update what's new (NVIDIA#1522) * restore set_env.sh (NVIDIA#1513) * Check that requirements are consistent through examples, update doc [skip ci] (NVIDIA#1521) * check that requirements are consistent through examples and add an item to migration notes * one more requirements txt * Doc & Talks updates [skip ci] (NVIDIA#1525) * update what's new * update year * updates to readmes and talks * Add config_type to distinguish (NVIDIA#1526) * Throw exception when connection monitor is not registered (NVIDIA#1520) * Added more detail when recursive data is found in FOBS * Throw exception when no monitor is registered --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Optimize the get_all_clients, move to the training process beginning. (NVIDIA#1524) Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * add pengfei to blossim-ci (NVIDIA#1528) * fix job status and speed up fed event end_run (NVIDIA#1523) * fix job status and speed up fedevent end_run * reformat * remove a debug line --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * update logging config example (NVIDIA#1530) * Fix a style issue on FAQ about server DNS propagation/caching. * Added Decomposers for HE Classes [skip ci] (NVIDIA#1527) * Added more detail when recursive data is found in FOBS * Added decomposer for CKKSVector * Fixed a supported_type() bug * Black reformat * Black format fix * Renamed he_decomposers to decomposers * When execute has result_error, raise exception instead of simple logging. (NVIDIA#1529) * Fix SAG typo (NVIDIA#1536) * Notebooks update [skip ci] (NVIDIA#1541) * repeat POC Setup based on QA inputs * fix typos * fix typos * Add new notebooks and some updates in docs [skip ci] (NVIDIA#1545) * add new notebooks from Kris' DLI and some updates in docs * add image for MONAI * update nvflare version (NVIDIA#1546) * Support direct cell message (NVIDIA#1534) * support direct cell comm * support direct cell msg * improve based on review comments * updated based on review comments --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Add a controller_lock to prevent racing condition (NVIDIA#1537) * Add a controller lock * Fix typo * Add link to on-shot-vfl repo (NVIDIA#1548) * Ha authentication fix (NVIDIA#1535) * Added the missing authentication functions for server job process. * notify the server state change to the running jobs. * codestyle fix. * renamed a logger. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * Fixed the shared object issue in the controller task return. (NVIDIA#1549) * Fixed the shared object issue in the controller task return. * codestyle fix. --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * Remove unneeded cancel all task call (NVIDIA#1540) * Add information about ssh source IP * Fix integration tests (NVIDIA#1492) * update scatter & gather messages (NVIDIA#1552) * Limit the FOBS error log size (NVIDIA#1544) * Added more detail when recursive data is found in FOBS * Limit the size of the log message for FOBS errors --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * Enhance job meta validator (NVIDIA#1555) * Enhance job meta validator * Fix typo * use python3 command (NVIDIA#1551) * update comments and exception messages[skip ci] (NVIDIA#1559) * update comments and exception messages * add items cache to executor * Remove manual serialize/deserialize for HE components (NVIDIA#1538) * HE refactoring to rm serialize/deserialize calls * fix simplify he aggregation code * run cifar10 with he * reset processed_algorithms * fix unit test * restore fl_context_utils.py * restore docstring formatting * fix weighted aggregation with HE * move bool flag to constructor * only check for the same process algorithm when accepting * formatting * remove abstract decorators when unnecessary; rename class * remove unused aggregation_weights in config * also introduce process_post_get() filter routine * Fix HE * use HECrossSiteModelEval in monai example * fix x-site val misconfig * use encryption during x-site validation * update warning message * add todos --------- Co-authored-by: YuanTingHsieh <yuantingh@nvidia.com> * Fix cell timing (NVIDIA#1558) * fix cell setup timing * fix cell setup timing * fixed list job * make client_cmd channel messages optional * fix invalid client error --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Restructure docs and notebooks as discussed [skip ci] (NVIDIA#1554) * restructure docs and notebooks as discussed * make updates * fix kernel * some more edits --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * update monai integration versions [skip_ci] (NVIDIA#1560) * update monai integration versions * update monai integration versions * Enhance preflight check (NVIDIA#1557) * Add preflight check to non-primary server * fix typo * Change optional to required * Fixed -m option in list_jobs [skip ci] (NVIDIA#1556) * Fix integration tests issues (NVIDIA#1562) * Fix incorrect server status after job aborted and server restarted * Updated a re-activate client error message. (NVIDIA#1567) * Early stop on both AWS/Azure when duplicate servers are launched (by design) Add document on this behavior * Fix abort job with only connected clients (NVIDIA#1563) * Fix a typo * update notebooks based on feedback [skip ci] (NVIDIA#1570) * update notebooks based on feedback * minor notebook change * more fixes and add links to notebooks in READMEs * Fix max client in client_manager (NVIDIA#1572) * Update fed policy example (NVIDIA#1575) Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Qa issues (NVIDIA#1568) * QA issues. * Refactored. * Removed commented out lines. * Changed to use logger. --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * fix a typo in a script (NVIDIA#1577) Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * Notebooks upgrade [skip ci] (NVIDIA#1574) * 1) change requirements.txt to make it possible to test 2) update POC and hello_world.ipynb * add provision.ipynb * remove outputs * split notebook pre-, post- run scripts * split notebook pre-, post- run scripts * update * update * update fed stats * update wording * update wording * remove clean up directories * fix RESULT_ERROR in FedStats (NVIDIA#1579) * fix RESULT_ERROR * check potential error condition * check potential error condition * check potential error condition * Fix SAG client result error handling (NVIDIA#1571) * update POC and tutorial storage locations [skip ci] (NVIDIA#1580) * update POC and tutorial storage locations * formatting --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * Controller no deepcopy (NVIDIA#1565) * Optimize controller not use deepcopy. * codestyle fix. * removed no used import. * Added interval and task_processed in the log message. * reformatted. --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Ensure to end the simulator run after client exception. (NVIDIA#1582) * Update xgboost path (NVIDIA#1584) * Notebook and documentation fixes [skip ci] (NVIDIA#1581) * notebook and documentation fixes * revise for PR * add link --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * update notebook setup_poc [skip_ci] (NVIDIA#1588) * update notebook * update notebook * add notebook links to README.md (NVIDIA#1585) * Update InitializeGlobalWeights workflow to not require clients (NVIDIA#1576) * update InitializeGlobalWeights workflow to not require clients * add type information * fix typo * handle different input args * addition reorganization of the linking for the documentation (NVIDIA#1591) * Fix provision notebook bugs [ski ci] (NVIDIA#1589) * fix bug * fix bug * minor fix to menu (NVIDIA#1594) * Change job_configs to jobs for consistency (NVIDIA#1596) * Add example of fednlp for NER task using BERT model (NVIDIA#1564) * Add nlp example for NER task using BERT model * minor updates * code polish * add data example * update learner for data loading * further refinement on docstring and pad_token * add seqeval licence * modify metric output and custom folder * format * add ner task details * config correction --------- Co-authored-by: Holger Roth <hroth@nvidia.com> * Ignore unknown task result in SAG (NVIDIA#1595) * Cell no executor pool (NVIDIA#1590) * Optimize controller not use deepcopy. * codestyle fix. * removed no used import. * Added interval and task_processed in the log message. * reformatted. * Changes for measure simulator performance. * Cell not use executor pool. * codestyle. * Removed the no use import. * optimized. * refactored. --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * test client-side model initialization (NVIDIA#1593) * test client-side model initialization * delete unused file * Fixed cell not been stopped properly when config error. (NVIDIA#1597) * Fixed cell not been stopped properly when config error. * added the exception trace. --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * fix bugs and cleanup notebooks (NVIDIA#1598) Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * Ensure the daemon process to re-start client root process will end if error happens (NVIDIA#1578) * Add job submit success to CI (NVIDIA#1601) * Fix typoe in fuel communicator (NVIDIA#1604) * Fix abort job command return message (NVIDIA#1603) * validate client name type in GlobalWeightsInitializer (NVIDIA#1606) * Revert "Ignore unknown task result in SAG (NVIDIA#1595)" (NVIDIA#1607) This reverts commit 4db55be. * fix workspace bug in notebook [skip ci] (NVIDIA#1605) * fix workspace bug * fix workspace bug * fix workspace bug * fix POC command bug (NVIDIA#1609) * fix workspace bug * fix workspace bug * fix workspace bug * fix workspace bug * fix workspace bug * restore some changes * restore dev branch for now * update split learning readme (NVIDIA#1610) * Re-factor PSI and add user email match to CI (NVIDIA#1583) * Add section on run modes and fix description for list_jobs in notebook (NVIDIA#1600) * add section on run modes * fix link * fix description for list_job in the notebook for the FLARE API --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Fix various notebook bugs [skip ci] (NVIDIA#1618) * Fix bugs * update * update * update * fix a bug * Don't submit update with task data from old SSID (NVIDIA#1611) * Don't submit update with task data from old SSID * undo other changes * Use fl context instead of cookie * Job status management enhancement (NVIDIA#1613) * job status enhancement. Added HA mode. * codestyle fix. * Added reviews. * Add docstring to executor (NVIDIA#1599) * fix controller dead client handling; added stats pool to_dict (NVIDIA#1617) * fix controller dead client handling; added stats pool to_dict * changed to handle all finished job status * remove unused imports; change to use parse_hist_mode * Make consistent the error message for shutdown_system without auth (NVIDIA#1614) * make consistent the error message for shutdown_system without auth * update command * fix ci * make updates as discussed * fix ci * fix ci * more changes from PR feedback --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * more notebooks bug fixes and updates [slip ci] (NVIDIA#1624) * update * fix notebooks --------- Co-authored-by: Zhihong Zhang <100308595+nvidianz@users.noreply.github.com> * Fixes several shutdown related issues (NVIDIA#1608) * Added more detail when recursive data is found in FOBS * Added exit_func to shutdown communicator * fixed the job status for config error. (NVIDIA#1615) * fixed the job status for config error. * Added FINISHED_ABNORMAL state to indicate the job complete with abnormal complete return code. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Fixed job could not run when overseer is offline. (NVIDIA#1625) * Fixed job could not run when overseer is offline. * removed no used import. * Removed the duplicate call. * add qat to repo (NVIDIA#1628) * add qat to repo * fix format * remove combo stuff * Removing UDS (NVIDIA#1616) * Added more detail when recursive data is found in FOBS * Removed UDS drivers * Add link and base readme to fed-ce repo [skip ci] (NVIDIA#1623) * Add link and base readme to fed-ce repo * Add link and base readme to fed-ce repo * Add link and base readme to fed-ce repo * add abstracts to fedsm and fedce --------- Co-authored-by: Holger Roth <6304754+holgerroth@users.noreply.github.com> * Add readme for one-shot VFL paper [skip ci] (NVIDIA#1629) * add readme * update license statement * update the abort_job status after the job complete. (NVIDIA#1627) * Change default initial task fetch interval at client side from 0.1 to 0.5 (NVIDIA#1621) * Add missing parent constructor (NVIDIA#1612) * fix POC stop exception (NVIDIA#1620) * Reduced the non-meaningful logs. (NVIDIA#1630) * Reduced the non-meaningful logs. * Added a space in the log. * Clean up fed stats example (NVIDIA#1602) * Clean up fed stats example * Address comments --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Add default_task_fetch_interval (NVIDIA#1633) * Fixed a save_workspace error. (NVIDIA#1634) * delay the overseer agent start for client job worker process. (NVIDIA#1636) * [PSI] add fl_ctx to finalize() and fix bug (NVIDIA#1638) * update README.md * update README.md * update README.md * update README.md * Create index.html * Add fl_ctx to finalize() method * remove extra files --------- Co-authored-by: chesterc <n9Z0GoPp5u1Y> * Scripts refactoring and notebooks bug fixes/update [skip ci] (NVIDIA#1635) * refactor shutdown_system * refactoring and bug fixes * refactoring and bug fixes * refactoring and bug fixes * refactoring and bug fixes * refactoring and bug fixes * refactoring and bug fixes * refactoring and bug fixes * include optional requirements * move the start and shutdown system to api_utils.py * Fix AIO task cancellation and improve abort_job (NVIDIA#1637) * add qat to repo * fix format * remove combo stuff * fix aio task cancellation; improve abort_job cmd * Fix CI (NVIDIA#1639) * update the aborted job status immediately (NVIDIA#1640) * update the aborted job status immediately. * Enhance the shutdown server running job check. * remove the _ensure_daemon_process_shutdown which caused restart fail. (NVIDIA#1642) * Correction to xgboost requirements files [skip ci] (NVIDIA#1641) * correction to xgboost requirements files * update xgboost version * Add GPT-2 model (NVIDIA#1626) * add got-2 functionality with corrected data loading and align * add got-2 functionality with corrected data loading and align * add got-2 functionality with corrected data loading and align * remove residules from notebook execution * add creating model message * update model diff computation --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Print job schedule result (NVIDIA#1631) Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Do not shutdown job runner when server turn to cold state (NVIDIA#1619) * Do not shutdown job runner when server turn to cold state * Fix review comments * address comments * use 1 arg instead of 2 args * Fix file license headers (NVIDIA#1643) * Fix header year * Fix issues * Update run test * Add to documentation (NVIDIA#1644) Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Use secure logging for exceptions (NVIDIA#1645) * Fixed the server_command_agent AUTHENTICATION_ERROR reply. (NVIDIA#1648) * Update the _turn_to_cold to set to ColdState first. (NVIDIA#1649) * Improvement on model diff computation (NVIDIA#1647) * adjust the computation of model diff / update * adjust the computation of model diff / update * adjust the computation of model diff / update --------- Co-authored-by: Holger Roth <6304754+holgerroth@users.noreply.github.com> * fix description of list_jobs in FLARE API notebook (NVIDIA#1646) Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Fix readme typos (NVIDIA#1653) * Change abort_job command to return None (NVIDIA#1650) * add qat to repo * fix format * remove combo stuff * fix aio task cancellation; improve abort_job cmd * change abort_job to return None * do not raise error when closing --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * notebooks tweaks [skip ci] (NVIDIA#1651) * upgrade notebooks * update notebooks * update notebooks * update notebooks * update notebooks --------- Co-authored-by: chesterc <n9Z0GoPp5u1Y> * fix abort_job in old FLAdminAPI (NVIDIA#1657) Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Update monai integration notebook [skip ci] (NVIDIA#1652) * Update split nn notebook (NVIDIA#1654) * Update xgboost notebooks [skip ci] (NVIDIA#1655) * Update RF notebook (NVIDIA#1656) * Add notebook info [skip ci] (NVIDIA#1658) * add section on notebook setup to docs, clean up index page * add sentence for VDR feedback * Improve example readme [skip ci] (NVIDIA#1659) * Improve example readme * Add install * update readme * Add markdown link check workflow [skip ci] (NVIDIA#1660) (NVIDIA#1661) * Add markdown link check workflow * Fix links * Fix links * Check modified files only * Remove unused file (NVIDIA#1671) * Update RC to real release (NVIDIA#1668) Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Cherry pick docs update to 2.3 branch (NVIDIA#1669) * Add markdown link check workflow [skip ci] (NVIDIA#1660) * Add markdown link check workflow * Fix links * Fix links * Check modified files only * cherry pick docs update to 2.3 branch --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> --------- Co-authored-by: Yuhong Wen <yuhongw@nvidia.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: nvkevlu <55759229+nvkevlu@users.noreply.github.com> Co-authored-by: Zhihong Zhang <100308595+nvidianz@users.noreply.github.com> Co-authored-by: Holger Roth <6304754+holgerroth@users.noreply.github.com> Co-authored-by: Yan Cheng <58191769+yanchengnv@users.noreply.github.com> Co-authored-by: Ziyue Xu <71786575+ZiyueXu77@users.noreply.github.com> Co-authored-by: Isaac Yang <isaacy@nvidia.com> Co-authored-by: Holger Roth <hroth@nvidia.com> Co-authored-by: Pengfei Guo <pengfeig@nvidia.com> Co-authored-by: Pengfei Guo <32000655+guopengf@users.noreply.github.com>

Add hello-monai and minor fixes

* change README to remove Quick start, reduce POC and other in quick start move feature highlights in release node README redesign * UPDATE README.md * UPDATE README.md * UPDATE README.md * UPDATE README.md * UPDATE README.md * UPDATE README.md * UPDATE README.md * Change to use new FLARE API * Change to use new FLARE API * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * update * update * update * Add notes for traditional ML and FedSM (#2) * Update README.md * Update README.md * update readme (#3) * update readme * update * more updates * Update README.md * Update README.md * Update release_notes.md * Update release_notes.md * minor text edit * minor text edit --------- Co-authored-by: Ziyue Xu <71786575+ZiyueXu77@users.noreply.github.com> Co-authored-by: Holger Roth <6304754+holgerroth@users.noreply.github.com>

* rename config related classes * add client api example * fix metric streaming * add to() routine

* WIP: constructed the FedJob. * WIP: server_app josn export. * generate the job app config. * fully functional pythonic job creation. * Added simulator_run for pythonic API. * reformat. * Added filters support for pythonic job creation. * handled the direct import case in fed_job. * refactor. * Added the resource_spec set function for FedJob. * refactored. * Moved the ClientApp and ServerApp into fed_app.py. * Refactored: removed the _FilterDef class. * refactored. * Rename job config classes (#3) * rename config related classes * add client api example * fix metric streaming * add to() routine * Enable obj in the constructor as paramenter. * Added support for the launcher script. * refactored. * reformat. * Update the comment. * re-arrange the package location. * Added add_ext_script() for BaseAppConfig. * codestyle fix. * Removed the client-api-pt example. * removed no used import. * fixed the in_time_accumulate_weighted_aggregator_test.py * Added Enum parameter support. * Added docstring. * Added ability to handle parameters from base class. * Move the parameter data format conversion to the START_RUN event for InProcessClientAPIExecutor. * Added params_exchange_format for PTInProcessClientAPIExecutor. * codestyle fix. * Fixed a custom code folder structure issue. * work for sub-folder custom files. * backed to handle parameters from base classes. * Support folder structure job config. * Added support for flat folder from '.XXX' import. * codestyle fix. * refactored and add docstring. * Address some of the PR reviews. --------- Co-authored-by: Holger Roth <6304754+holgerroth@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>

* rename config related classes * add client api example * fix metric streaming * add to() routine

* Implement federated logistic regression with second-order newton raphson. Update file headers. Update README. Update README. Fix README. Refine README. Update README. Added more logging for the job status changing. (NVIDIA#2480) * Added more logging for the job status changing. * Fixed a logging call error. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Fix update client status (NVIDIA#2508) * check workflow id before updating client status * change order of checks Add user guide on how to deploy to EKS (NVIDIA#2510) * Add user guide on how to deploy to EKS * Address comments Improve dead client handling (NVIDIA#2506) * dev * test dead client cmd * added more info for dead client tracing * remove unused imports * fix unit test * fix test case * address PR comments --------- Co-authored-by: Sean Yang <seany314@gmail.com> Enhance WFController (NVIDIA#2505) * set flmodel variables in basefedavg * make round info optional, fix inproc api bug temporarily disable preflight tests (NVIDIA#2521) Upgrade dependencies (NVIDIA#2516) Use full path for PSI components (NVIDIA#2437) (NVIDIA#2517) Multiple bug fixes from 2.4 (NVIDIA#2518) * [2.4] Support client custom code in simulator (NVIDIA#2447) * Support client custom code in simulator * Fix client custom code * Remove cancel_futures args (NVIDIA#2457) * Fix sub_worker_process shutdown (NVIDIA#2458) * Set GRPC_ENABLE_FORK_SUPPORT to False (NVIDIA#2474) Pythonic job creation (NVIDIA#2483) * WIP: constructed the FedJob. * WIP: server_app josn export. * generate the job app config. * fully functional pythonic job creation. * Added simulator_run for pythonic API. * reformat. * Added filters support for pythonic job creation. * handled the direct import case in fed_job. * refactor. * Added the resource_spec set function for FedJob. * refactored. * Moved the ClientApp and ServerApp into fed_app.py. * Refactored: removed the _FilterDef class. * refactored. * Rename job config classes (#3) * rename config related classes * add client api example * fix metric streaming * add to() routine * Enable obj in the constructor as paramenter. * Added support for the launcher script. * refactored. * reformat. * Update the comment. * re-arrange the package location. * Added add_ext_script() for BaseAppConfig. * codestyle fix. * Removed the client-api-pt example. * removed no used import. * fixed the in_time_accumulate_weighted_aggregator_test.py * Added Enum parameter support. * Added docstring. * Added ability to handle parameters from base class. * Move the parameter data format conversion to the START_RUN event for InProcessClientAPIExecutor. * Added params_exchange_format for PTInProcessClientAPIExecutor. * codestyle fix. * Fixed a custom code folder structure issue. * work for sub-folder custom files. * backed to handle parameters from base classes. * Support folder structure job config. * Added support for flat folder from '.XXX' import. * codestyle fix. * refactored and add docstring. * Address some of the PR reviews. --------- Co-authored-by: Holger Roth <6304754+holgerroth@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Enhancements from 2.4 (NVIDIA#2519) * Starts heartbeat after task is pull and before task execution (NVIDIA#2415) * Starts pipe handler heartbeat send/check after task is pull before task execution (NVIDIA#2442) * [2.4] Improve cell pipe timeout handling (NVIDIA#2441) * improve cell pipe timeout handling * improved end and abort handling * improve timeout handling --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * [2.4] Enhance launcher executor (NVIDIA#2433) * Update LauncherExecutor logs and execution setup timeout * Change name * [2.4] Fire and forget for pipe handler control messages (NVIDIA#2413) * Fire and forget for pipe handler control messages * Add default timeout value * fix wait-for-reply (NVIDIA#2478) * Fix pipe handler timeout in task exchanger and launcher executor (NVIDIA#2495) * Fix metric relay pipe handler timeout (NVIDIA#2496) * Rely on launcher check_run_status to pause/resume hb (NVIDIA#2502) Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> --------- Co-authored-by: Yan Cheng <58191769+yanchengnv@users.noreply.github.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Update ci cd from 2.4 (NVIDIA#2520) * Update github actions (NVIDIA#2450) * Fix premerge (NVIDIA#2467) * Fix issues on hello-world TF2 notebook * Fix tf integration test (NVIDIA#2504) * Add client api integration tests --------- Co-authored-by: Isaac Yang <isaacy@nvidia.com> Co-authored-by: Sean Yang <seany314@gmail.com> use controller name for stats (NVIDIA#2522) Simulator workspace re-design (NVIDIA#2492) * Redesign simulator workspace structure. * working, needs clean. * Changed the simulator workspacce structure to be consistent with POC. * Moved the logfile init to start_server_app(). * optimzed. * adjust the stats pool location. * Addressed the PR views. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Simulator end run for all clients (NVIDIA#2514) * Provide an option to run END_RUN for all clients. * Added end_run_all option for simulator to run END_RUN event for all clients. * Fixed a add_argument type, added help message. * Changed to use add_argument(() compatible with python 3.8. * reformat. * rewrite the _end_run_clients() and add docstring for easier understanding. * reformat. * adjusting the locking in the _end_run_clients. * Fixed a potential None pointer error. * renamed the clients_finished_end_run variable. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Sean Yang <seany314@gmail.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Secure XGBoost Integration (NVIDIA#2512) * Updated FOBS readme to add DatumManager, added agrpcs as secure scheme * Refactoring * Refactored the secure version to histogram_based_v2 * Replaced Paillier with a mock encryptor * Added license header * Put mock back * Added metrics_writer back and fixed GRPC error reply simplify job simulator_run to take only one workspace parameter. (NVIDIA#2528) Fix README. Fix file links in README. Fix file links in README. Add comparison between centralized and federated training code. Add missing client api test jobs (NVIDIA#2535) Fixed the simulator server workspace root dir (NVIDIA#2533) * Fixed the simulator server root dir error. * Added unit test for SimulatorRunner start_server_app. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Improve InProcessClientAPIExecutor (NVIDIA#2536) * 1. rename ExeTaskFnWrapper class to TaskScriptRunner 2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function 3. redirect print() to logger.info() * 1. rename ExeTaskFnWrapper class to TaskScriptRunner 2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function 3. redirect print() to logger.info() * make result check and result pull use the same configurable variable * rename exec_task_fn_wrapper to task_script_runner.py * fix typo Update README for launching python script. Modify tensorboard logdir. Link to environment setup instructions. expose aggregate_fn to users for overwriting (NVIDIA#2539) FIX MLFLow and Tensorboard Output to be consistent with new Workspace root changes (NVIDIA#2537) * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1. Remove the default code to use configuration 2. fix some broken notebook * rollback changes Fix decorator issue (NVIDIA#2542) Remove line number in code link. FLModel summary (NVIDIA#2544) * add FLModel Summary * format formatting Update KM example, add 2-stage solution without HE (NVIDIA#2541) * add KM without HE, update everything * fix license header * fix license header - update year to 2024 * fix format --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * update license --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Holger Roth <hroth@nvidia.com>

generate the job app config. fully functional pythonic job creation. Added simulator_run for pythonic API. reformat. Added filters support for pythonic job creation. handled the direct import case in fed_job. refactor. Added the resource_spec set function for FedJob. refactored. Moved the ClientApp and ServerApp into fed_app.py. Refactored: removed the _FilterDef class. refactored. Rename job config classes (#3) * rename config related classes * add client api example * fix metric streaming * add to() routine Enable obj in the constructor as paramenter. Added support for the launcher script. refactored. reformat. Update the comment. re-arrange the package location. Added add_ext_script() for BaseAppConfig. codestyle fix. Removed the client-api-pt example. removed no used import. fixed the in_time_accumulate_weighted_aggregator_test.py Added Enum parameter support. Added docstring. Fix typo (NVIDIA#2432) Enable StreamCell for all application channels (NVIDIA#2407) Add back request header (NVIDIA#2440) Check wandb login (NVIDIA#2445) * check wandb login * Use default wandb offline mode * add mode online check Add note about delay in workspace creation for larger jobs (NVIDIA#2454) Client API Update: Job Templates, examples to reflect different type of Client API (NVIDIA#2456) * 1. Update README 2. fix bugs on in-proc client API 3. update examples to use in-proc client api in cases make sense * 1. update documentation * 1. update job template description 2. update in process API to allow user keep the existing configuration 3. update notebooks for step-by-step sag * update README.md * remove task_fn_args argument in the executor * remove task_fn_args argument in the executor add controller interface (NVIDIA#2451) Update README.md (NVIDIA#2460) fix typo improve reliable msg (NVIDIA#2459) CC block byoc jobs (NVIDIA#2403) * WIP: tdx_cc integration. * fixed toke_file read. * WIP: added info for CC add client tokens.: * Fixed an error when client does not have CC token reported. * Added handle for client does not have CC_INFO. * Added CLIENT_QUIT event for CCManager to remove client token. * Added _add_client_token client token logging info. * Added peer_ctx for client quit. * set_peer_context for client quit. * Changed the AUTHORIZATION_REASON set_prop sticky to False. * WIP: TokenPundit interface change. * WIP: added cc_authorizer_ids config. * Added cc_issuer_id for CCManager. * renamed the TokenPundit to CCAutorizer. * Added CC token adding through client heartbeat. * Added function to stop current running job if CC verify fail. * if CC failed to get toke, don't allow the system to start. * Added exceptions None check. * Address the client side CC check before job scheduled. * fixed the PEER_FL_CONTEXT error. * Added CCManager support to have multiple cc_issuers. * optimized CCManager. * updated the _verify_participants() logic. * set up the proper fl_ctx for admin send_requests(). * Add proper fl_ctx. * Refactor the CCManager. * Refactor the CCManager and TDX_authorizer. * Added TOKEN_EXPIRATION for each cc_issue in CCManager. * Fixed CC TOKEN_EXPIRATION error. * refactor the CCManager _prepare_cc_info() * Refactor. * refactor the cc tokens periodic verification. * added critical_level for CCManager. * codestyle fix. * removed no used import. * removed no use import. * Fixed the unitest. * Added CCManager unit tests. * Added CCTokenGenerateError and CCTokenVerifyError. Updated CCAuthorizer interface. * WIP: CC block byoc job. * block BYOC job for CC. * Addressed some PR reviews. * Added exception catch for TDXAuthorizer. * codestyle fix. * renamed some events. * renamed event names. * renamed event names. --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Fixed the authz and site_security check for check_resource command. (NVIDIA#2462) Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> add garbage collect at ends of round-based workflows (NVIDIA#2463) add WFController (NVIDIA#2468) Add warning when the same admin in project.yml has different role Add custom order and early termination to CyclicController (NVIDIA#2387) * Add custom order and early termination to CyclicController and add tests * Add more error handling Add IPC agent and exchanger (NVIDIA#2435) * support av ipc agent * removed unused import * address PR comments fix typo (NVIDIA#2473) Refactor WFController and ModelController (NVIDIA#2475) * refactor wf and model controller * clarify persisor_id Add example for mulitparty kaplan-meier analysis with HE (NVIDIA#2259) * add example for mulitparty kaplan meier analysis with HE * update requirements * update baseline script, remove complex settings and keep basic only * add readme with details * add readme with details * add curves, modify saving functions (curve and km details) * job name update * remove redundant print * move data preparation part out of local code * move HE context part out of FL process to better accomodate the transition to real application * update to use new controller interface * change to send_model_and_wait * format * updated readme * fix merge conflict * update readme * update readme * update readme * update readme * move to job template --------- Co-authored-by: Sean Yang <seany314@gmail.com> remove old task_fn_args (NVIDIA#2479) Enable simulator to run HE (NVIDIA#2339) * Enable simulator to run HE. * fixed the unittest. * Created startup folder for simulator run if not exist. * Changed to use setup and teardown for pytest. * extract common codes init_security_content_service(). * removed no use import. --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> not creating Workspace object (NVIDIA#2489) Fix xgboost integration tests (NVIDIA#2486) * change to use path * update finance and vertical xgboost Added ability to handle parameters from base class. Move the parameter data format conversion to the START_RUN event for InProcessClientAPIExecutor. Added params_exchange_format for PTInProcessClientAPIExecutor. codestyle fix. Fixed a custom code folder structure issue. work for sub-folder custom files. backed to handle parameters from base classes. Support folder structure job config. Added support for flat folder from '.XXX' import. codestyle fix. refactored and add docstring. Add FedBPT research example (NVIDIA#2465) * Add FedBPT research example initial fedbpt files add roberta model and run FL move send to end upgrade to 2.4.1rc and run experiment with 10 clients move init to top debug using pickle record successful setting use custom decomposer clean code add summary writer add result figure formatting fix broken links remove debug messages update readme with system resources use decomposer widget on server * address comments; enable selection of evaluation client * use new FedAvg api * exclude dir from license test * only exclude file for license check fix xgboost test setup (NVIDIA#2494) add Client API documentation (NVIDIA#2497) * add Client API documentation * add Client API documentation Added more logging for the job status changing. (NVIDIA#2480) * Added more logging for the job status changing. * Fixed a logging call error. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Fix update client status (NVIDIA#2508) * check workflow id before updating client status * change order of checks Address some of the PR reviews. Rename job config classes (#3) * rename config related classes * add client api example * fix metric streaming * add to() routine run demo run demo set gpus and external scripts move FedJob api change folder structure xval example xval example reuse code add filter example minor updates update job dir refactor Controller/ExcecutorApps hide ControllerApp/ExecutorApp fix doubled deploy call handle filters handle cross-site val add swarm example (wip) Add user guide on how to deploy to EKS (NVIDIA#2510) * Add user guide on how to deploy to EKS * Address comments Improve dead client handling (NVIDIA#2506) * dev * test dead client cmd * added more info for dead client tracing * remove unused imports * fix unit test * fix test case * address PR comments --------- Co-authored-by: Sean Yang <seany314@gmail.com> Enhance WFController (NVIDIA#2505) * set flmodel variables in basefedavg * make round info optional, fix inproc api bug temporarily disable preflight tests (NVIDIA#2521) Upgrade dependencies (NVIDIA#2516) Use full path for PSI components (NVIDIA#2437) (NVIDIA#2517) Multiple bug fixes from 2.4 (NVIDIA#2518) * [2.4] Support client custom code in simulator (NVIDIA#2447) * Support client custom code in simulator * Fix client custom code * Remove cancel_futures args (NVIDIA#2457) * Fix sub_worker_process shutdown (NVIDIA#2458) * Set GRPC_ENABLE_FORK_SUPPORT to False (NVIDIA#2474) Pythonic job creation (NVIDIA#2483) * WIP: constructed the FedJob. * WIP: server_app josn export. * generate the job app config. * fully functional pythonic job creation. * Added simulator_run for pythonic API. * reformat. * Added filters support for pythonic job creation. * handled the direct import case in fed_job. * refactor. * Added the resource_spec set function for FedJob. * refactored. * Moved the ClientApp and ServerApp into fed_app.py. * Refactored: removed the _FilterDef class. * refactored. * Rename job config classes (#3) * rename config related classes * add client api example * fix metric streaming * add to() routine * Enable obj in the constructor as paramenter. * Added support for the launcher script. * refactored. * reformat. * Update the comment. * re-arrange the package location. * Added add_ext_script() for BaseAppConfig. * codestyle fix. * Removed the client-api-pt example. * removed no used import. * fixed the in_time_accumulate_weighted_aggregator_test.py * Added Enum parameter support. * Added docstring. * Added ability to handle parameters from base class. * Move the parameter data format conversion to the START_RUN event for InProcessClientAPIExecutor. * Added params_exchange_format for PTInProcessClientAPIExecutor. * codestyle fix. * Fixed a custom code folder structure issue. * work for sub-folder custom files. * backed to handle parameters from base classes. * Support folder structure job config. * Added support for flat folder from '.XXX' import. * codestyle fix. * refactored and add docstring. * Address some of the PR reviews. --------- Co-authored-by: Holger Roth <6304754+holgerroth@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Enhancements from 2.4 (NVIDIA#2519) * Starts heartbeat after task is pull and before task execution (NVIDIA#2415) * Starts pipe handler heartbeat send/check after task is pull before task execution (NVIDIA#2442) * [2.4] Improve cell pipe timeout handling (NVIDIA#2441) * improve cell pipe timeout handling * improved end and abort handling * improve timeout handling --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * [2.4] Enhance launcher executor (NVIDIA#2433) * Update LauncherExecutor logs and execution setup timeout * Change name * [2.4] Fire and forget for pipe handler control messages (NVIDIA#2413) * Fire and forget for pipe handler control messages * Add default timeout value * fix wait-for-reply (NVIDIA#2478) * Fix pipe handler timeout in task exchanger and launcher executor (NVIDIA#2495) * Fix metric relay pipe handler timeout (NVIDIA#2496) * Rely on launcher check_run_status to pause/resume hb (NVIDIA#2502) Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> --------- Co-authored-by: Yan Cheng <58191769+yanchengnv@users.noreply.github.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Update ci cd from 2.4 (NVIDIA#2520) * Update github actions (NVIDIA#2450) * Fix premerge (NVIDIA#2467) * Fix issues on hello-world TF2 notebook * Fix tf integration test (NVIDIA#2504) * Add client api integration tests --------- Co-authored-by: Isaac Yang <isaacy@nvidia.com> Co-authored-by: Sean Yang <seany314@gmail.com> WIP: constructed the FedJob. WIP: server_app josn export. generate the job app config. fully functional pythonic job creation. Added simulator_run for pythonic API. reformat. Added filters support for pythonic job creation. handled the direct import case in fed_job. refactor. Added the resource_spec set function for FedJob. refactored. Moved the ClientApp and ServerApp into fed_app.py. Refactored: removed the _FilterDef class. refactored. Rename job config classes (#3) * rename config related classes * add client api example * fix metric streaming * add to() routine Enable obj in the constructor as paramenter. Added support for the launcher script. refactored. reformat. Update the comment. re-arrange the package location. Added add_ext_script() for BaseAppConfig. codestyle fix. Removed the client-api-pt example. Rename job config classes (#3) * rename config related classes * add client api example * fix metric streaming * add to() routine run demo set gpus and external scripts move FedJob api change folder structure xval example xval example reuse code add filter example minor updates update job dir refactor Controller/ExcecutorApps hide ControllerApp/ExecutorApp fix doubled deploy call handle filters handle cross-site val add swarm example (wip) make FedJob2 default FedJob use ScriptExecutor test swarm learning add cyclic workflow add todo update swarm learning make FedJob2 default again use controller name for stats (NVIDIA#2522) Simulator workspace re-design (NVIDIA#2492) * Redesign simulator workspace structure. * working, needs clean. * Changed the simulator workspacce structure to be consistent with POC. * Moved the logfile init to start_server_app(). * optimzed. * adjust the stats pool location. * Addressed the PR views. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Simulator end run for all clients (NVIDIA#2514) * Provide an option to run END_RUN for all clients. * Added end_run_all option for simulator to run END_RUN event for all clients. * Fixed a add_argument type, added help message. * Changed to use add_argument(() compatible with python 3.8. * reformat. * rewrite the _end_run_clients() and add docstring for easier understanding. * reformat. * adjusting the locking in the _end_run_clients. * Fixed a potential None pointer error. * renamed the clients_finished_end_run variable. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Sean Yang <seany314@gmail.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Secure XGBoost Integration (NVIDIA#2512) * Updated FOBS readme to add DatumManager, added agrpcs as secure scheme * Refactoring * Refactored the secure version to histogram_based_v2 * Replaced Paillier with a mock encryptor * Added license header * Put mock back * Added metrics_writer back and fixed GRPC error reply use ScriptExecutor add kmeans example simplify job simulator_run to take only one workspace parameter. (NVIDIA#2528) test kmeans, use latest main fix kmeans some redesign address comments rename source dir Add missing client api test jobs (NVIDIA#2535) Fixed the simulator server workspace root dir (NVIDIA#2533) * Fixed the simulator server root dir error. * Added unit test for SimulatorRunner start_server_app. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Improve InProcessClientAPIExecutor (NVIDIA#2536) * 1. rename ExeTaskFnWrapper class to TaskScriptRunner 2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function 3. redirect print() to logger.info() * 1. rename ExeTaskFnWrapper class to TaskScriptRunner 2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function 3. redirect print() to logger.info() * make result check and result pull use the same configurable variable * rename exec_task_fn_wrapper to task_script_runner.py * fix typo remove use of uuid4 handle ids of built-in components expose aggregate_fn to users for overwriting (NVIDIA#2539) FIX MLFLow and Tensorboard Output to be consistent with new Workspace root changes (NVIDIA#2537) * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1. Remove the default code to use configuration 2. fix some broken notebook * rollback changes Fix decorator issue (NVIDIA#2542) FLModel summary (NVIDIA#2544) * add FLModel Summary * format Update KM example, add 2-stage solution without HE (NVIDIA#2541) * add KM without HE, update everything * fix license header * fix license header - update year to 2024 * fix format --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> handle cases where the script with relative path in Script Runner (NVIDIA#2543) * handle cases where the script with relative path * handle cases where the script with relative path * add more unit test cases and change the file search logics * code format * add more unit test cases and change the file search logics Lr newton raphson (NVIDIA#2529) * Implement federated logistic regression with second-order newton raphson. Update file headers. Update README. Update README. Fix README. Refine README. Update README. Added more logging for the job status changing. (NVIDIA#2480) * Added more logging for the job status changing. * Fixed a logging call error. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Fix update client status (NVIDIA#2508) * check workflow id before updating client status * change order of checks Add user guide on how to deploy to EKS (NVIDIA#2510) * Add user guide on how to deploy to EKS * Address comments Improve dead client handling (NVIDIA#2506) * dev * test dead client cmd * added more info for dead client tracing * remove unused imports * fix unit test * fix test case * address PR comments --------- Co-authored-by: Sean Yang <seany314@gmail.com> Enhance WFController (NVIDIA#2505) * set flmodel variables in basefedavg * make round info optional, fix inproc api bug temporarily disable preflight tests (NVIDIA#2521) Upgrade dependencies (NVIDIA#2516) Use full path for PSI components (NVIDIA#2437) (NVIDIA#2517) Multiple bug fixes from 2.4 (NVIDIA#2518) * [2.4] Support client custom code in simulator (NVIDIA#2447) * Support client custom code in simulator * Fix client custom code * Remove cancel_futures args (NVIDIA#2457) * Fix sub_worker_process shutdown (NVIDIA#2458) * Set GRPC_ENABLE_FORK_SUPPORT to False (NVIDIA#2474) Pythonic job creation (NVIDIA#2483) * WIP: constructed the FedJob. * WIP: server_app josn export. * generate the job app config. * fully functional pythonic job creation. * Added simulator_run for pythonic API. * reformat. * Added filters support for pythonic job creation. * handled the direct import case in fed_job. * refactor. * Added the resource_spec set function for FedJob. * refactored. * Moved the ClientApp and ServerApp into fed_app.py. * Refactored: removed the _FilterDef class. * refactored. * Rename job config classes (#3) * rename config related classes * add client api example * fix metric streaming * add to() routine * Enable obj in the constructor as paramenter. * Added support for the launcher script. * refactored. * reformat. * Update the comment. * re-arrange the package location. * Added add_ext_script() for BaseAppConfig. * codestyle fix. * Removed the client-api-pt example. * removed no used import. * fixed the in_time_accumulate_weighted_aggregator_test.py * Added Enum parameter support. * Added docstring. * Added ability to handle parameters from base class. * Move the parameter data format conversion to the START_RUN event for InProcessClientAPIExecutor. * Added params_exchange_format for PTInProcessClientAPIExecutor. * codestyle fix. * Fixed a custom code folder structure issue. * work for sub-folder custom files. * backed to handle parameters from base classes. * Support folder structure job config. * Added support for flat folder from '.XXX' import. * codestyle fix. * refactored and add docstring. * Address some of the PR reviews. --------- Co-authored-by: Holger Roth <6304754+holgerroth@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Enhancements from 2.4 (NVIDIA#2519) * Starts heartbeat after task is pull and before task execution (NVIDIA#2415) * Starts pipe handler heartbeat send/check after task is pull before task execution (NVIDIA#2442) * [2.4] Improve cell pipe timeout handling (NVIDIA#2441) * improve cell pipe timeout handling * improved end and abort handling * improve timeout handling --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * [2.4] Enhance launcher executor (NVIDIA#2433) * Update LauncherExecutor logs and execution setup timeout * Change name * [2.4] Fire and forget for pipe handler control messages (NVIDIA#2413) * Fire and forget for pipe handler control messages * Add default timeout value * fix wait-for-reply (NVIDIA#2478) * Fix pipe handler timeout in task exchanger and launcher executor (NVIDIA#2495) * Fix metric relay pipe handler timeout (NVIDIA#2496) * Rely on launcher check_run_status to pause/resume hb (NVIDIA#2502) Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> --------- Co-authored-by: Yan Cheng <58191769+yanchengnv@users.noreply.github.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Update ci cd from 2.4 (NVIDIA#2520) * Update github actions (NVIDIA#2450) * Fix premerge (NVIDIA#2467) * Fix issues on hello-world TF2 notebook * Fix tf integration test (NVIDIA#2504) * Add client api integration tests --------- Co-authored-by: Isaac Yang <isaacy@nvidia.com> Co-authored-by: Sean Yang <seany314@gmail.com> use controller name for stats (NVIDIA#2522) Simulator workspace re-design (NVIDIA#2492) * Redesign simulator workspace structure. * working, needs clean. * Changed the simulator workspacce structure to be consistent with POC. * Moved the logfile init to start_server_app(). * optimzed. * adjust the stats pool location. * Addressed the PR views. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Simulator end run for all clients (NVIDIA#2514) * Provide an option to run END_RUN for all clients. * Added end_run_all option for simulator to run END_RUN event for all clients. * Fixed a add_argument type, added help message. * Changed to use add_argument(() compatible with python 3.8. * reformat. * rewrite the _end_run_clients() and add docstring for easier understanding. * reformat. * adjusting the locking in the _end_run_clients. * Fixed a potential None pointer error. * renamed the clients_finished_end_run variable. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Sean Yang <seany314@gmail.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Secure XGBoost Integration (NVIDIA#2512) * Updated FOBS readme to add DatumManager, added agrpcs as secure scheme * Refactoring * Refactored the secure version to histogram_based_v2 * Replaced Paillier with a mock encryptor * Added license header * Put mock back * Added metrics_writer back and fixed GRPC error reply simplify job simulator_run to take only one workspace parameter. (NVIDIA#2528) Fix README. Fix file links in README. Fix file links in README. Add comparison between centralized and federated training code. Add missing client api test jobs (NVIDIA#2535) Fixed the simulator server workspace root dir (NVIDIA#2533) * Fixed the simulator server root dir error. * Added unit test for SimulatorRunner start_server_app. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Improve InProcessClientAPIExecutor (NVIDIA#2536) * 1. rename ExeTaskFnWrapper class to TaskScriptRunner 2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function 3. redirect print() to logger.info() * 1. rename ExeTaskFnWrapper class to TaskScriptRunner 2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function 3. redirect print() to logger.info() * make result check and result pull use the same configurable variable * rename exec_task_fn_wrapper to task_script_runner.py * fix typo Update README for launching python script. Modify tensorboard logdir. Link to environment setup instructions. expose aggregate_fn to users for overwriting (NVIDIA#2539) FIX MLFLow and Tensorboard Output to be consistent with new Workspace root changes (NVIDIA#2537) * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1. Remove the default code to use configuration 2. fix some broken notebook * rollback changes Fix decorator issue (NVIDIA#2542) Remove line number in code link. FLModel summary (NVIDIA#2544) * add FLModel Summary * format formatting Update KM example, add 2-stage solution without HE (NVIDIA#2541) * add KM without HE, update everything * fix license header * fix license header - update year to 2024 * fix format --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * update license --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Holger Roth <hroth@nvidia.com> handle ids minor updates rename folder use default ids update kmeans add lightning example handle multiple GPUs make model selection metric configurable make model selection metric configurable add docstrings Add information about dig (bind9-dnsutils) in the document Update monai readme to remove logging.conf (NVIDIA#2552) MONAI mednist example (NVIDIA#2532) * add monai notebook * add training script * update example * update notebook * use job template * call init later * swith back * add gitignore * update notebooks * add readmes * send received model to GPU * use monai tb stats handler * formatting Improve AWS cloud launch script restore files reset file. Add docstring formatting

…ormalization federated learning method (NVIDIA#2524) * add research/fedbn * delete redudant controller and correct figs requirements * update plot_requirements * rewrite fedbn * update jobs * remove workspace * update README * simplify job simulator_run to take only one workspace parameter. (NVIDIA#2528) * Add missing client api test jobs (NVIDIA#2535) * Fixed the simulator server workspace root dir (NVIDIA#2533) * Fixed the simulator server root dir error. * Added unit test for SimulatorRunner start_server_app. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Improve InProcessClientAPIExecutor (NVIDIA#2536) * 1. rename ExeTaskFnWrapper class to TaskScriptRunner 2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function 3. redirect print() to logger.info() * 1. rename ExeTaskFnWrapper class to TaskScriptRunner 2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function 3. redirect print() to logger.info() * make result check and result pull use the same configurable variable * rename exec_task_fn_wrapper to task_script_runner.py * fix typo * FIX MLFLow and Tensorboard Output to be consistent with new Workspace root changes (NVIDIA#2537) * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1. Remove the default code to use configuration 2. fix some broken notebook * rollback changes * Fix decorator issue (NVIDIA#2542) * update create and run job script * FLModel summary (NVIDIA#2544) * add FLModel Summary * format * remove jobs folder * expose aggregate_fn to users for overwriting (NVIDIA#2539) * handle cases where the script with relative path in Script Runner (NVIDIA#2543) * handle cases where the script with relative path * handle cases where the script with relative path * add more unit test cases and change the file search logics * code format * add more unit test cases and change the file search logics * Lr newton raphson (NVIDIA#2529) * Implement federated logistic regression with second-order newton raphson. Update file headers. Update README. Update README. Fix README. Refine README. Update README. Added more logging for the job status changing. (NVIDIA#2480) * Added more logging for the job status changing. * Fixed a logging call error. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Fix update client status (NVIDIA#2508) * check workflow id before updating client status * change order of checks Add user guide on how to deploy to EKS (NVIDIA#2510) * Add user guide on how to deploy to EKS * Address comments Improve dead client handling (NVIDIA#2506) * dev * test dead client cmd * added more info for dead client tracing * remove unused imports * fix unit test * fix test case * address PR comments --------- Co-authored-by: Sean Yang <seany314@gmail.com> Enhance WFController (NVIDIA#2505) * set flmodel variables in basefedavg * make round info optional, fix inproc api bug temporarily disable preflight tests (NVIDIA#2521) Upgrade dependencies (NVIDIA#2516) Use full path for PSI components (NVIDIA#2437) (NVIDIA#2517) Multiple bug fixes from 2.4 (NVIDIA#2518) * [2.4] Support client custom code in simulator (NVIDIA#2447) * Support client custom code in simulator * Fix client custom code * Remove cancel_futures args (NVIDIA#2457) * Fix sub_worker_process shutdown (NVIDIA#2458) * Set GRPC_ENABLE_FORK_SUPPORT to False (NVIDIA#2474) Pythonic job creation (NVIDIA#2483) * WIP: constructed the FedJob. * WIP: server_app josn export. * generate the job app config. * fully functional pythonic job creation. * Added simulator_run for pythonic API. * reformat. * Added filters support for pythonic job creation. * handled the direct import case in fed_job. * refactor. * Added the resource_spec set function for FedJob. * refactored. * Moved the ClientApp and ServerApp into fed_app.py. * Refactored: removed the _FilterDef class. * refactored. * Rename job config classes (#3) * rename config related classes * add client api example * fix metric streaming * add to() routine * Enable obj in the constructor as paramenter. * Added support for the launcher script. * refactored. * reformat. * Update the comment. * re-arrange the package location. * Added add_ext_script() for BaseAppConfig. * codestyle fix. * Removed the client-api-pt example. * removed no used import. * fixed the in_time_accumulate_weighted_aggregator_test.py * Added Enum parameter support. * Added docstring. * Added ability to handle parameters from base class. * Move the parameter data format conversion to the START_RUN event for InProcessClientAPIExecutor. * Added params_exchange_format for PTInProcessClientAPIExecutor. * codestyle fix. * Fixed a custom code folder structure issue. * work for sub-folder custom files. * backed to handle parameters from base classes. * Support folder structure job config. * Added support for flat folder from '.XXX' import. * codestyle fix. * refactored and add docstring. * Address some of the PR reviews. --------- Co-authored-by: Holger Roth <6304754+holgerroth@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Enhancements from 2.4 (NVIDIA#2519) * Starts heartbeat after task is pull and before task execution (NVIDIA#2415) * Starts pipe handler heartbeat send/check after task is pull before task execution (NVIDIA#2442) * [2.4] Improve cell pipe timeout handling (NVIDIA#2441) * improve cell pipe timeout handling * improved end and abort handling * improve timeout handling --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * [2.4] Enhance launcher executor (NVIDIA#2433) * Update LauncherExecutor logs and execution setup timeout * Change name * [2.4] Fire and forget for pipe handler control messages (NVIDIA#2413) * Fire and forget for pipe handler control messages * Add default timeout value * fix wait-for-reply (NVIDIA#2478) * Fix pipe handler timeout in task exchanger and launcher executor (NVIDIA#2495) * Fix metric relay pipe handler timeout (NVIDIA#2496) * Rely on launcher check_run_status to pause/resume hb (NVIDIA#2502) Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> --------- Co-authored-by: Yan Cheng <58191769+yanchengnv@users.noreply.github.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Update ci cd from 2.4 (NVIDIA#2520) * Update github actions (NVIDIA#2450) * Fix premerge (NVIDIA#2467) * Fix issues on hello-world TF2 notebook * Fix tf integration test (NVIDIA#2504) * Add client api integration tests --------- Co-authored-by: Isaac Yang <isaacy@nvidia.com> Co-authored-by: Sean Yang <seany314@gmail.com> use controller name for stats (NVIDIA#2522) Simulator workspace re-design (NVIDIA#2492) * Redesign simulator workspace structure. * working, needs clean. * Changed the simulator workspacce structure to be consistent with POC. * Moved the logfile init to start_server_app(). * optimzed. * adjust the stats pool location. * Addressed the PR views. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Simulator end run for all clients (NVIDIA#2514) * Provide an option to run END_RUN for all clients. * Added end_run_all option for simulator to run END_RUN event for all clients. * Fixed a add_argument type, added help message. * Changed to use add_argument(() compatible with python 3.8. * reformat. * rewrite the _end_run_clients() and add docstring for easier understanding. * reformat. * adjusting the locking in the _end_run_clients. * Fixed a potential None pointer error. * renamed the clients_finished_end_run variable. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Sean Yang <seany314@gmail.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Secure XGBoost Integration (NVIDIA#2512) * Updated FOBS readme to add DatumManager, added agrpcs as secure scheme * Refactoring * Refactored the secure version to histogram_based_v2 * Replaced Paillier with a mock encryptor * Added license header * Put mock back * Added metrics_writer back and fixed GRPC error reply simplify job simulator_run to take only one workspace parameter. (NVIDIA#2528) Fix README. Fix file links in README. Fix file links in README. Add comparison between centralized and federated training code. Add missing client api test jobs (NVIDIA#2535) Fixed the simulator server workspace root dir (NVIDIA#2533) * Fixed the simulator server root dir error. * Added unit test for SimulatorRunner start_server_app. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Improve InProcessClientAPIExecutor (NVIDIA#2536) * 1. rename ExeTaskFnWrapper class to TaskScriptRunner 2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function 3. redirect print() to logger.info() * 1. rename ExeTaskFnWrapper class to TaskScriptRunner 2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function 3. redirect print() to logger.info() * make result check and result pull use the same configurable variable * rename exec_task_fn_wrapper to task_script_runner.py * fix typo Update README for launching python script. Modify tensorboard logdir. Link to environment setup instructions. expose aggregate_fn to users for overwriting (NVIDIA#2539) FIX MLFLow and Tensorboard Output to be consistent with new Workspace root changes (NVIDIA#2537) * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1. Remove the default code to use configuration 2. fix some broken notebook * rollback changes Fix decorator issue (NVIDIA#2542) Remove line number in code link. FLModel summary (NVIDIA#2544) * add FLModel Summary * format formatting Update KM example, add 2-stage solution without HE (NVIDIA#2541) * add KM without HE, update everything * fix license header * fix license header - update year to 2024 * fix format --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * update license --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Holger Roth <hroth@nvidia.com> * Add information about dig (bind9-dnsutils) in the document * format update * Update KM example, add 2-stage solution without HE (NVIDIA#2541) * add KM without HE, update everything * fix license header * fix license header - update year to 2024 * fix format --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Update monai readme to remove logging.conf (NVIDIA#2552) * MONAI mednist example (NVIDIA#2532) * add monai notebook * add training script * update example * update notebook * use job template * call init later * swith back * add gitignore * update notebooks * add readmes * send received model to GPU * use monai tb stats handler * formatting * Improve AWS cloud launch script * Add in process client api tests (NVIDIA#2549) * Add in process client api tests * Fix headers * Fix comments * Add client controller executor (NVIDIA#2530) * add client controller executor * address comments * enhance abort, set peer props * remove asserts --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * Add option in dashboard cli for AWS vpc and subnet * add note on README visualization * update README * update readme * update readme * update readme * [2.5] Clean up to allow creation of nvflare light (NVIDIA#2573) * clean up to allow creation of nvflare light * move defs to cellnet * Enable patch and build for nvflight (NVIDIA#2574) * verified commit --------- Co-authored-by: Yuhong Wen <yuhongw@nvidia.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Sean Yang <seany314@gmail.com> Co-authored-by: Zhijin <zhijinl@nvidia.com> Co-authored-by: Holger Roth <hroth@nvidia.com> Co-authored-by: Isaac Yang <isaacy@nvidia.com> Co-authored-by: Ziyue Xu <ziyue.xu@gmail.com> Co-authored-by: Ziyue Xu <71786575+ZiyueXu77@users.noreply.github.com> Co-authored-by: Holger Roth <6304754+holgerroth@users.noreply.github.com> Co-authored-by: Yan Cheng <58191769+yanchengnv@users.noreply.github.com>

* Updated FOBS readme to add DatumManager, added agrpcs as secure scheme * Implemented horizontal calls in nvflare plugin * Added support for horizontal secure XGBoost * Fixed a few horizontal issues * Added reliable message * Added ReliableMessage parameters * Added log for debugging empty rcv_buf * Added finally block to finish duplicate seq * Removed debug statements * format change * Add in process client api tests (NVIDIA#2549) * Add in process client api tests * Fix headers * Fix comments * Add client controller executor (NVIDIA#2530) * add client controller executor * address comments * enhance abort, set peer props * remove asserts --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * Add option in dashboard cli for AWS vpc and subnet * [2.5] Clean up to allow creation of nvflare light (NVIDIA#2573) * clean up to allow creation of nvflare light * move defs to cellnet * Enable patch and build for nvflight (NVIDIA#2574) * add FedBN Implementation on NVFlare research folder - a local batch normalization federated learning method (NVIDIA#2524) * add research/fedbn * delete redudant controller and correct figs requirements * update plot_requirements * rewrite fedbn * update jobs * remove workspace * update README * simplify job simulator_run to take only one workspace parameter. (NVIDIA#2528) * Add missing client api test jobs (NVIDIA#2535) * Fixed the simulator server workspace root dir (NVIDIA#2533) * Fixed the simulator server root dir error. * Added unit test for SimulatorRunner start_server_app. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Improve InProcessClientAPIExecutor (NVIDIA#2536) * 1. rename ExeTaskFnWrapper class to TaskScriptRunner 2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function 3. redirect print() to logger.info() * 1. rename ExeTaskFnWrapper class to TaskScriptRunner 2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function 3. redirect print() to logger.info() * make result check and result pull use the same configurable variable * rename exec_task_fn_wrapper to task_script_runner.py * fix typo * FIX MLFLow and Tensorboard Output to be consistent with new Workspace root changes (NVIDIA#2537) * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1. Remove the default code to use configuration 2. fix some broken notebook * rollback changes * Fix decorator issue (NVIDIA#2542) * update create and run job script * FLModel summary (NVIDIA#2544) * add FLModel Summary * format * remove jobs folder * expose aggregate_fn to users for overwriting (NVIDIA#2539) * handle cases where the script with relative path in Script Runner (NVIDIA#2543) * handle cases where the script with relative path * handle cases where the script with relative path * add more unit test cases and change the file search logics * code format * add more unit test cases and change the file search logics * Lr newton raphson (NVIDIA#2529) * Implement federated logistic regression with second-order newton raphson. Update file headers. Update README. Update README. Fix README. Refine README. Update README. Added more logging for the job status changing. (NVIDIA#2480) * Added more logging for the job status changing. * Fixed a logging call error. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Fix update client status (NVIDIA#2508) * check workflow id before updating client status * change order of checks Add user guide on how to deploy to EKS (NVIDIA#2510) * Add user guide on how to deploy to EKS * Address comments Improve dead client handling (NVIDIA#2506) * dev * test dead client cmd * added more info for dead client tracing * remove unused imports * fix unit test * fix test case * address PR comments --------- Co-authored-by: Sean Yang <seany314@gmail.com> Enhance WFController (NVIDIA#2505) * set flmodel variables in basefedavg * make round info optional, fix inproc api bug temporarily disable preflight tests (NVIDIA#2521) Upgrade dependencies (NVIDIA#2516) Use full path for PSI components (NVIDIA#2437) (NVIDIA#2517) Multiple bug fixes from 2.4 (NVIDIA#2518) * [2.4] Support client custom code in simulator (NVIDIA#2447) * Support client custom code in simulator * Fix client custom code * Remove cancel_futures args (NVIDIA#2457) * Fix sub_worker_process shutdown (NVIDIA#2458) * Set GRPC_ENABLE_FORK_SUPPORT to False (NVIDIA#2474) Pythonic job creation (NVIDIA#2483) * WIP: constructed the FedJob. * WIP: server_app josn export. * generate the job app config. * fully functional pythonic job creation. * Added simulator_run for pythonic API. * reformat. * Added filters support for pythonic job creation. * handled the direct import case in fed_job. * refactor. * Added the resource_spec set function for FedJob. * refactored. * Moved the ClientApp and ServerApp into fed_app.py. * Refactored: removed the _FilterDef class. * refactored. * Rename job config classes (#3) * rename config related classes * add client api example * fix metric streaming * add to() routine * Enable obj in the constructor as paramenter. * Added support for the launcher script. * refactored. * reformat. * Update the comment. * re-arrange the package location. * Added add_ext_script() for BaseAppConfig. * codestyle fix. * Removed the client-api-pt example. * removed no used import. * fixed the in_time_accumulate_weighted_aggregator_test.py * Added Enum parameter support. * Added docstring. * Added ability to handle parameters from base class. * Move the parameter data format conversion to the START_RUN event for InProcessClientAPIExecutor. * Added params_exchange_format for PTInProcessClientAPIExecutor. * codestyle fix. * Fixed a custom code folder structure issue. * work for sub-folder custom files. * backed to handle parameters from base classes. * Support folder structure job config. * Added support for flat folder from '.XXX' import. * codestyle fix. * refactored and add docstring. * Address some of the PR reviews. --------- Co-authored-by: Holger Roth <6304754+holgerroth@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Enhancements from 2.4 (NVIDIA#2519) * Starts heartbeat after task is pull and before task execution (NVIDIA#2415) * Starts pipe handler heartbeat send/check after task is pull before task execution (NVIDIA#2442) * [2.4] Improve cell pipe timeout handling (NVIDIA#2441) * improve cell pipe timeout handling * improved end and abort handling * improve timeout handling --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * [2.4] Enhance launcher executor (NVIDIA#2433) * Update LauncherExecutor logs and execution setup timeout * Change name * [2.4] Fire and forget for pipe handler control messages (NVIDIA#2413) * Fire and forget for pipe handler control messages * Add default timeout value * fix wait-for-reply (NVIDIA#2478) * Fix pipe handler timeout in task exchanger and launcher executor (NVIDIA#2495) * Fix metric relay pipe handler timeout (NVIDIA#2496) * Rely on launcher check_run_status to pause/resume hb (NVIDIA#2502) Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> --------- Co-authored-by: Yan Cheng <58191769+yanchengnv@users.noreply.github.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Update ci cd from 2.4 (NVIDIA#2520) * Update github actions (NVIDIA#2450) * Fix premerge (NVIDIA#2467) * Fix issues on hello-world TF2 notebook * Fix tf integration test (NVIDIA#2504) * Add client api integration tests --------- Co-authored-by: Isaac Yang <isaacy@nvidia.com> Co-authored-by: Sean Yang <seany314@gmail.com> use controller name for stats (NVIDIA#2522) Simulator workspace re-design (NVIDIA#2492) * Redesign simulator workspace structure. * working, needs clean. * Changed the simulator workspacce structure to be consistent with POC. * Moved the logfile init to start_server_app(). * optimzed. * adjust the stats pool location. * Addressed the PR views. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Simulator end run for all clients (NVIDIA#2514) * Provide an option to run END_RUN for all clients. * Added end_run_all option for simulator to run END_RUN event for all clients. * Fixed a add_argument type, added help message. * Changed to use add_argument(() compatible with python 3.8. * reformat. * rewrite the _end_run_clients() and add docstring for easier understanding. * reformat. * adjusting the locking in the _end_run_clients. * Fixed a potential None pointer error. * renamed the clients_finished_end_run variable. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Sean Yang <seany314@gmail.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Secure XGBoost Integration (NVIDIA#2512) * Updated FOBS readme to add DatumManager, added agrpcs as secure scheme * Refactoring * Refactored the secure version to histogram_based_v2 * Replaced Paillier with a mock encryptor * Added license header * Put mock back * Added metrics_writer back and fixed GRPC error reply simplify job simulator_run to take only one workspace parameter. (NVIDIA#2528) Fix README. Fix file links in README. Fix file links in README. Add comparison between centralized and federated training code. Add missing client api test jobs (NVIDIA#2535) Fixed the simulator server workspace root dir (NVIDIA#2533) * Fixed the simulator server root dir error. * Added unit test for SimulatorRunner start_server_app. --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Improve InProcessClientAPIExecutor (NVIDIA#2536) * 1. rename ExeTaskFnWrapper class to TaskScriptRunner 2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function 3. redirect print() to logger.info() * 1. rename ExeTaskFnWrapper class to TaskScriptRunner 2. Replace implementation of the inprocess function exection from calling a main() function to user runpy.run_path() which reduce the user requirements to have main() function 3. redirect print() to logger.info() * make result check and result pull use the same configurable variable * rename exec_task_fn_wrapper to task_script_runner.py * fix typo Update README for launching python script. Modify tensorboard logdir. Link to environment setup instructions. expose aggregate_fn to users for overwriting (NVIDIA#2539) FIX MLFLow and Tensorboard Output to be consistent with new Workspace root changes (NVIDIA#2537) * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1) fix mlruns and tb_events dirs due to workspace directory changes 2) for MLFLow, add tracking_rui default to workspace_dir / <job_id>/mlruns instead current default <workspace_dir>/mlruns. This is a) consistent with Tensorboard 2) avoid job output oeverwrite the 1st job * 1. Remove the default code to use configuration 2. fix some broken notebook * rollback changes Fix decorator issue (NVIDIA#2542) Remove line number in code link. FLModel summary (NVIDIA#2544) * add FLModel Summary * format formatting Update KM example, add 2-stage solution without HE (NVIDIA#2541) * add KM without HE, update everything * fix license header * fix license header - update year to 2024 * fix format --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * update license --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Holger Roth <hroth@nvidia.com> * Add information about dig (bind9-dnsutils) in the document * format update * Update KM example, add 2-stage solution without HE (NVIDIA#2541) * add KM without HE, update everything * fix license header * fix license header - update year to 2024 * fix format --------- Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> * Update monai readme to remove logging.conf (NVIDIA#2552) * MONAI mednist example (NVIDIA#2532) * add monai notebook * add training script * update example * update notebook * use job template * call init later * swith back * add gitignore * update notebooks * add readmes * send received model to GPU * use monai tb stats handler * formatting * Improve AWS cloud launch script * Add in process client api tests (NVIDIA#2549) * Add in process client api tests * Fix headers * Fix comments * Add client controller executor (NVIDIA#2530) * add client controller executor * address comments * enhance abort, set peer props * remove asserts --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * Add option in dashboard cli for AWS vpc and subnet * add note on README visualization * update README * update readme * update readme * update readme * [2.5] Clean up to allow creation of nvflare light (NVIDIA#2573) * clean up to allow creation of nvflare light * move defs to cellnet * Enable patch and build for nvflight (NVIDIA#2574) * verified commit --------- Co-authored-by: Yuhong Wen <yuhongw@nvidia.com> Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Sean Yang <seany314@gmail.com> Co-authored-by: Zhijin <zhijinl@nvidia.com> Co-authored-by: Holger Roth <hroth@nvidia.com> Co-authored-by: Isaac Yang <isaacy@nvidia.com> Co-authored-by: Ziyue Xu <ziyue.xu@gmail.com> Co-authored-by: Ziyue Xu <71786575+ZiyueXu77@users.noreply.github.com> Co-authored-by: Holger Roth <6304754+holgerroth@users.noreply.github.com> Co-authored-by: Yan Cheng <58191769+yanchengnv@users.noreply.github.com> * fix MLFLOW example (NVIDIA#2575) Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * BugFix: InProcessClientAPIExecutor's TaskScriptRunner (NVIDIA#2558) * 1) find script full path to indicate which site script to avoid loading run script 2) make sure the task script failed will cause the client to return failure status which will trigger job stop rather wait forever 3) add different unit tests * sort key in unit test * add logic to improve error message * style format * add more tests and logics * code format * code format * fix steps error * fix global steps * rollback some changes and split it into another PR * rollback some changes and split it into another PR --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * update client_api.png (NVIDIA#2577) * Fix the simulator worker sys path (NVIDIA#2561) * Fixed the simulator worker sys path. * fixed the get_new_sys_path() logic, added in unit test. * fixed isort. * Changed the _get_new_sys_path() implementation. --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> * ReliableMessage register is changed to register aux message. Added support for Mac with vertical --------- Co-authored-by: Yuan-Ting Hsieh (謝沅廷) <yuantingh@nvidia.com> Co-authored-by: Sean Yang <seany314@gmail.com> Co-authored-by: Isaac Yang <isaacy@nvidia.com> Co-authored-by: Yan Cheng <58191769+yanchengnv@users.noreply.github.com> Co-authored-by: Minghui Chen <50226876+MinghuiChen43@users.noreply.github.com> Co-authored-by: Yuhong Wen <yuhongw@nvidia.com> Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Zhijin <zhijinl@nvidia.com> Co-authored-by: Holger Roth <hroth@nvidia.com> Co-authored-by: Ziyue Xu <ziyue.xu@gmail.com> Co-authored-by: Ziyue Xu <71786575+ZiyueXu77@users.noreply.github.com> Co-authored-by: Holger Roth <6304754+holgerroth@users.noreply.github.com>

holgerroth added 3 commits November 8, 2022 10:58

write initial valid metrics from t2 clients to tensorboard

d8c6afa

fix _before_accept event type

4bb1edb

update reset_starts event

e42975a

holgerroth merged commit 5431ec4 into flhub_v2 Nov 8, 2022

holgerroth pushed a commit that referenced this pull request Dec 4, 2023

Merge pull request #3 from IsaacYangSLA/hello-monai

03070e5

Add hello-monai and minor fixes

holgerroth added a commit that referenced this pull request Apr 10, 2024

Rename job config classes (#3)

5311fc7

* rename config related classes * add client api example * fix metric streaming * add to() routine

holgerroth added a commit that referenced this pull request Apr 11, 2024

Rename job config classes (#3)

845dbf8

* rename config related classes * add client api example * fix metric streaming * add to() routine

holgerroth added a commit that referenced this pull request Apr 12, 2024

Rename job config classes (#3)

3450041

* rename config related classes * add client api example * fix metric streaming * add to() routine

holgerroth added a commit that referenced this pull request Apr 12, 2024

Rename job config classes (#3)

e0e74b7

* rename config related classes * add client api example * fix metric streaming * add to() routine

holgerroth added a commit that referenced this pull request Apr 15, 2024

Rename job config classes (#3)

1e339ca

* rename config related classes * add client api example * fix metric streaming * add to() routine

holgerroth added a commit that referenced this pull request Apr 17, 2024

Rename job config classes (#3)

aded58c

* rename config related classes * add client api example * fix metric streaming * add to() routine

holgerroth added a commit that referenced this pull request Apr 18, 2024

Rename job config classes (#3)

2278e80

* rename config related classes * add client api example * fix metric streaming * add to() routine

holgerroth added a commit that referenced this pull request Apr 18, 2024

Rename job config classes (#3)

74ed9f8

* rename config related classes * add client api example * fix metric streaming * add to() routine

holgerroth added a commit that referenced this pull request Apr 19, 2024

Rename job config classes (#3)

1b8d719

* rename config related classes * add client api example * fix metric streaming * add to() routine

holgerroth added a commit that referenced this pull request Apr 19, 2024

Rename job config classes (#3)

a518e46

* rename config related classes * add client api example * fix metric streaming * add to() routine

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

write initial valid metrics from t2 clients to tensorboard #3

write initial valid metrics from t2 clients to tensorboard #3

holgerroth commented Nov 8, 2022

write initial valid metrics from t2 clients to tensorboard #3

write initial valid metrics from t2 clients to tensorboard #3

Conversation

holgerroth commented Nov 8, 2022