Fixes dtype of SPLIT column if already provided in CSV #2140

geoffreyangus · 2022-06-14T19:18:59Z

This PR is a follow-up to #2058, which removed the assumption that CSV columns would be loaded with dtypes inferred by Dask/Pandas. Instead, all columns are now initially loaded with dtype object, with the expectation that these are casted to their appropriate dtype downstream. The SPLIT column is therefore not read in with dtype int. This leads to the following error when saving out the SPLIT column:

ludwig/api.py:1276: in preprocess
    preprocessed_data = preprocess_for_training(
ludwig/data/preprocessing.py:1577: in preprocess_for_training
    processed = data_format_processor.preprocess_for_training(
ludwig/data/preprocessing.py:275: in preprocess_for_training
    return _preprocess_file_for_training(
ludwig/data/preprocessing.py:1674: in _preprocess_file_for_training
    save_array(split_fp, data[SPLIT])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
data_fp = '/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn/T/tmpdjpna6ui/23500A2F7A.split.csv'
array = Dask Series Structure:
npartitions=1
    object
       ...
Name: split, dtype: object
Dask Name: getitem, 21 tasks

    def save_array(data_fp, array):
        with open_file(data_fp, "w") as output_file:
>           for x in np.nditer(array):
E           TypeError: Iterator operand or requested dtype holds references, but the REFS_OK flag was not enabled

ludwig/utils/data_utils.py:411: TypeError

This PR ensures that the SPLIT column provided is given the same dtype as it would be if the split was randomly generated. This PR also includes a regression test.

github-actions · 2022-06-14T19:52:29Z

Unit Test Results

      4 files ±0       4 suites ±0 1h 18m 35s ⏱️ - 2m 32s
2 825 tests +2 2 791 ✔️ +2 34 💤 ±0 0 ❌ ±0
5 650 runs +4 5 579 ✔️ +4 71 💤 ±0 0 ❌ ±0

Results for commit 28ddd3f. ± Comparison against base commit 520af82.

Fixes dtype of SPLIT column if already provided

28ddd3f

geoffreyangus requested a review from justinxzhao June 14, 2022 19:18

justinxzhao approved these changes Jun 14, 2022

View reviewed changes

tgaddair merged commit 59dcd19 into master Jun 14, 2022

tgaddair deleted the fix-split-dtype branch June 14, 2022 23:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes dtype of SPLIT column if already provided in CSV #2140

Fixes dtype of SPLIT column if already provided in CSV #2140

geoffreyangus commented Jun 14, 2022 •

edited

Loading

github-actions bot commented Jun 14, 2022

Fixes dtype of SPLIT column if already provided in CSV #2140

Fixes dtype of SPLIT column if already provided in CSV #2140

Conversation

geoffreyangus commented Jun 14, 2022 • edited Loading

github-actions bot commented Jun 14, 2022

Unit Test Results

geoffreyangus commented Jun 14, 2022 •

edited

Loading