Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes dtype of SPLIT column if already provided in CSV #2140

Merged
merged 1 commit into from
Jun 14, 2022

Conversation

geoffreyangus
Copy link
Collaborator

@geoffreyangus geoffreyangus commented Jun 14, 2022

This PR is a follow-up to #2058, which removed the assumption that CSV columns would be loaded with dtypes inferred by Dask/Pandas. Instead, all columns are now initially loaded with dtype object, with the expectation that these are casted to their appropriate dtype downstream. The SPLIT column is therefore not read in with dtype int. This leads to the following error when saving out the SPLIT column:

ludwig/api.py:1276: in preprocess
    preprocessed_data = preprocess_for_training(
ludwig/data/preprocessing.py:1577: in preprocess_for_training
    processed = data_format_processor.preprocess_for_training(
ludwig/data/preprocessing.py:275: in preprocess_for_training
    return _preprocess_file_for_training(
ludwig/data/preprocessing.py:1674: in _preprocess_file_for_training
    save_array(split_fp, data[SPLIT])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
data_fp = '/var/folders/23/8ddc07cj3knbkldhnfv2cmbr0000gn/T/tmpdjpna6ui/23500A2F7A.split.csv'
array = Dask Series Structure:
npartitions=1
    object
       ...
Name: split, dtype: object
Dask Name: getitem, 21 tasks

    def save_array(data_fp, array):
        with open_file(data_fp, "w") as output_file:
>           for x in np.nditer(array):
E           TypeError: Iterator operand or requested dtype holds references, but the REFS_OK flag was not enabled

ludwig/utils/data_utils.py:411: TypeError

This PR ensures that the SPLIT column provided is given the same dtype as it would be if the split was randomly generated. This PR also includes a regression test.

@github-actions
Copy link

Unit Test Results

       4 files  ±0         4 suites  ±0   1h 18m 35s ⏱️ - 2m 32s
2 825 tests +2  2 791 ✔️ +2  34 💤 ±0  0 ±0 
5 650 runs  +4  5 579 ✔️ +4  71 💤 ±0  0 ±0 

Results for commit 28ddd3f. ± Comparison against base commit 520af82.

@tgaddair tgaddair merged commit 59dcd19 into master Jun 14, 2022
@tgaddair tgaddair deleted the fix-split-dtype branch June 14, 2022 23:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants