v0.4.1: Ray training, Ray datasets, experimental AutoML with auto config generation integrated with hyperopt on RayTune, image improvements, Python3.9/TF2.7
Summary
This release features experimental AutoML with auto config generation and auto-training integrated with hyperopt on RayTune, and integrations with Ray training and Ray datasets. We're still working on a comprehensive overhaul of the documentation, and all the new functionality will all available in the upcoming v0.5 too.
Aside from critical bugs and new datasets, v0.4.1 will be the last release of Ludwig using TensorFlow. Starting with v0.5+ (release coming soon), Ludwig will use PyTorch as the backend for tensor computation. We will release a blogpost detailing the rationale and impact of this decision, but we wanted to do one last TensorFlow release to make sure that all those committed to a TensorFlow ecosystem that have used Ludwig so far could enjoy the benefits of many bug fixes and improvements we did on the codebase that were not specific to PyTorch.
The next version v0.5 will also have several additional improvements that we’ll be excited to share in the coming weeks.
Additions
- Non-absolute image path support by @hungcs in #1224
- Add image dim inference to schema by @hungcs in #1225
- Additional Tabular Datasets by @amholler (#1226, #1230, #1237)
- Initial implementation of the end-to-end autotrain module by @ANarayan in #1219
- [automl] AutoML Extended public API by @tgaddair in #1235
- Add image dimension inference to automl by @hungcs in #1243
- [automl] Memory Aware Config Tuning by @ANarayan in #1257
- Added DataFrame wrapper type and fixed usage of optional imports by @tgaddair in #1371
- Added Dask kwargs to Ray backend by @tgaddair in #1380
- Configure Dask to determine parallelism automatically by default by @tgaddair in #1383
- Add Ray backend to Ray hyperopt by @Yard1 in #1269
- Add additional hyperopt callbacks by @hungcs in #1388
- Added preprocessing callbacks by @tgaddair in #1398
- Added Slack and Twitter badges by @tgaddair in #1399
- Add support for Ray Train and Ray Datasets in training by @tgaddair in #1391
- Add combiner schema validation by @ksbrar in #1347
- Publish unit test results by @tgaddair in #1414
- Publish test results for fork repos as well by @EnricoMi in #1442
- Build docker images for tf-legacy by @tgaddair in #1504
- Added init_config and render_config command-line utils (#1506) by @tgaddair in #1514
- Add experiment heuristics to automl module (variant of Avanika PR 1362) by @amholler in #1507
- Add random_seed to auto_train API to improve repeatability by @amholler in #1619
- Support use_reference_config option to AutoML to add initial trial from relevant best past model by @amholler in #1636
- Add remote checkpoint support to ray tune post search evaluation by @amholler in #1646
- [datasets] Add remote filesystem support to datasets module by @ANarayan in #1244
- Add sample training by @amholler in #1227
- Add support for Santander Customer Satisfaction dataset, along with s… by @amholler in #1238
Improvements
- Allow logging params to mlflow from any epoch by @tgaddair in #1211
- Changed remote fs behavior to upload at the end of each epoch by @tgaddair in #1210
- Add metric and loss modules for RMSE, RMSPE, and AUC by @ANarayan in #1214
- [hyperopt] fixed metric_score to use test split when available by @tgaddair in #1239
- Fixed metric selection to ignore config split if unavailable by @tgaddair in #1248
- Ray Tune Intermediate Checkpoint Cleaning by @ANarayan in #1255
- Do not initialize Ray if already initalized by @Yard1 in #1277
- Changed default combiner to concat from tabnet by @ShreyaR in #1278
- Ray data migration by @ShreyaR in #1260
- Fix automl to treat binary as categorical when missing values present by @tgaddair in #1292
- Add serialization for DatasetInfo and round avg_words to int by @hungcs in #1294
- Cast
max_lengthto int inbuild_sequence_matrix::padby @Yard1 in #1295 - [automl] update model config parameter ranges by @ANarayan in #1298
- Change INFER_IMAGE_DIMENSIONS default to True by @hungcs in #1303
- Add HTTPS retries for image urls by @hungcs in #1304
- Return None for unreadable images and try to infer num channels by @hungcs in #1307
- Add gray image/avg image fallbacks for unreachable images by @hungcs in #1312
- Account for image extensions during image type inference by @hungcs in #1335
- Fixed schema validation to handle null preprocessing values for strings by @tgaddair in #1344
- Added default size and output_size for tabnet by @tgaddair in #1355
- Removed DaskBackend and moved tests to RayBackend by @tgaddair in #1412
- Perform preprocessing first before hyperopt when possible by @tgaddair in #1415
- Employ a fallback str2bool mapping from the feature column's distinct values when the feature's values aren't boolean-like. by @justinxzhao in #1471
- Remove trailing dot in income label field in adult_census… by @amholler in #1475
- Update Ludwig AutoML Feature Type Selection by @amholler in #1485
- Update infer_type tests to reflect interface and functionality updates by @amholler in #1493
- Skip converting to TensorDType if the column is binary by @tgaddair in #1547
- Remove TensorDType conversion for all scalar types by @tgaddair in #1560
- Update AutoML tabular model type choice to remove heuristic for concat by @amholler in #1548
- Better handle empty fields with distinct_values=[] by @hungcs in #1574
- Port #1476 ('dict' option for weights_initializer and bias_initializer) to tf_legacy by @ksbrar in #1599
- Modify combiners to accept input_features as a dict instead of a list by @jeffreyftang in #1618
- Update hyperopt: Choose best model from validation data; For stopped Ray Tune trials, run evaluate at search end by @amholler in #1612
- Keep search_alg type in dict to record in hyperopt_statistics.json by @amholler in #1626
- For ames_housing, remove test.csv from processing; it has no label column which prevents test split eval by @amholler in #1634
- Improve Ludwig resilience to Ray Tune issues by @amholler in #1660
- Handle download gzip files by @amholler in #1676
- Upgrade tf from 2.5.2 to 2.7.0. by @justinxzhao in #1713
- Add basic precommit to tf-legacy to pass precommit checks on tf-legacy PRs. by @justinxzhao in #1718
- For kdd datasets, do not include unlabeled test data by default by @amholler in #1704
- Use config which has been previously validated by @vreyespue in #1213
- Update Readme to activate directly the virtualenv by @vreyespue in #1212
- doc: Correct README.md link to Developer Guide by @jimthompson5802 in #1217
- Update pandas version by @w4nderlust in #1223
- Modify Kaggle datasets to not process test sets by @ANarayan in #1233
- Restructure dataframe preprocessing setup and change to avoid creatin… by @amholler in #1240
Bug fixes
- Fixed Keras imports by @w4nderlust in #1215
- Fix assert in tabnet to be tf assert_rank by @w4nderlust in #1222
- Fixed read_csv for Dask by @tgaddair in #1247
- Fix TensorFlow CUDA version mismatch in Ray GPU image by @tgaddair in #1256
- Fix excluded field detection by @tgaddair in #1285
- Fixed automl to work when combiner is not specified by @tgaddair in #1293
- FIX: Issue 1181 resolves the ZeroDivisionError when calculating sample variance by @jimthompson5802 in #1326
- Fixed steps_per_epoch to be computed on batch resizing by @tgaddair in #1402
- Fix evaluation and visualization of confusion_matrix by @carlogrisetti in #1408
- Fixed auto eval batch size when train batch size is set by @tgaddair in #1410
- Fixed gpu isolation by @tgaddair in #1455
- Address issues in AutoML managing time-budget while exploring trial space by @amholler in #1535
- Fixed RayDatasets by @tgaddair in #1565
- Fix makedirs call to path_exists to pass url by @amholler in #1592
- Fixed KeyError while creating default config (#1643) by @tgaddair in #1654
- Fix FileNotFoundError while caching when cache_dir is … by @ShreyaR in #1665
- Fixed TabNet conversion to TF graph with unknown batch size by @tgaddair in #1252
Other changes and things to note
- Moved experiments to separate repo by @tgaddair in #1245
- Neuropod does not yet support python 3.9. Ludwig still supports neuropod for python<=3.8.
New Contributors
- @vreyespue made their first contribution in #1213
- @Yard1 made their first contribution in #1277
- @EnricoMi made their first contribution in #1442
Full Changelog: v0.4...v0.4.1