Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Fixed csv infer_schema on empty fields #1342

Conversation

tripokey
Copy link
Contributor

The default infer function provided by src/io/csv/utils.rs infers an empty field as DataType::Utf8. If the same column contains non empty fields containing valid DataType::Float64 data the column in infer_schema will then contain both DataTypes. When merge_schema is called it will decide that DataType::Utf8 has precedence over DataType::Float64 and set the column type to DataType::Float64.

If we instead decline to infer anything for a field without any data we will in the end get DataType::Float64 if the column contained other fields with valid f64 data and if we only had empty data for the column it will still default to DataType::Utf8.

@jorgecarleitao
Copy link
Owner

Thanks a lot for the PR! Could you add a test in tests/it/io/csv demoing this, so we get a regression test?

@codecov
Copy link

codecov bot commented Dec 30, 2022

Codecov Report

Base: 83.12% // Head: 83.29% // Increases project coverage by +0.16% 🎉

Coverage data is based on head (9cc8041) compared to base (e698d48).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1342      +/-   ##
==========================================
+ Coverage   83.12%   83.29%   +0.16%     
==========================================
  Files         370      371       +1     
  Lines       40196    40241      +45     
==========================================
+ Hits        33414    33519     +105     
+ Misses       6782     6722      -60     
Impacted Files Coverage Δ
src/io/csv/read/infer_schema.rs 97.43% <100.00%> (-2.57%) ⬇️
src/types/native.rs 66.41% <0.00%> (-1.04%) ⬇️
src/io/ipc/read/file.rs 96.42% <0.00%> (-0.45%) ⬇️
src/array/growable/structure.rs 96.34% <0.00%> (-0.22%) ⬇️
src/compute/hash.rs 100.00% <0.00%> (ø)
src/types/bit_chunk.rs 100.00% <0.00%> (ø)
src/compute/comparison/mod.rs 64.51% <0.00%> (ø)
src/compute/arithmetics/decimal/add.rs 96.70% <0.00%> (ø)
src/compute/arithmetics/decimal/div.rs 95.39% <0.00%> (ø)
... and 47 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@tripokey
Copy link
Contributor Author

Sure 🙂, I'm on vacation now but I will look into it when I'm back.

@tripokey
Copy link
Contributor Author

I added the test infer_ints_with_empty_fields() which verifies that type inference still works when there is an empty field present.

I apologize for the delay...

@jorgecarleitao jorgecarleitao changed the title fix csv infer_schema on empty fields Fixed csv infer_schema on empty fields Jan 16, 2023
@jorgecarleitao jorgecarleitao merged commit 218b7cf into jorgecarleitao:main Jan 16, 2023
@jorgecarleitao jorgecarleitao added the bug Something isn't working label Jan 16, 2023
@jorgecarleitao
Copy link
Owner

Thanks a lot, merged!

ritchie46 pushed a commit to ritchie46/arrow2 that referenced this pull request Mar 29, 2023
* fix csv infer_schema on empty fields

* Add test for csv::infer_schema with empty fields
ritchie46 pushed a commit to ritchie46/arrow2 that referenced this pull request Apr 5, 2023
* fix csv infer_schema on empty fields

* Add test for csv::infer_schema with empty fields
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants