Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Improved inference and deserialization of CSV #483

Merged
merged 1 commit into from
Oct 2, 2021
Merged

Conversation

jorgecarleitao
Copy link
Owner

@jorgecarleitao jorgecarleitao commented Oct 1, 2021

This PR:

  • improves performance of inference by removing the &[u8] -> utf8 for types that do not need such conversion (integer, boolean, float)
  • Added support to infer and deserialize timestamps with timezone
  • Added support to deserialize all timestamp units
  • Removed two dependencies to reading csv: lazy_static and regex, so that the inference is 100% consistent with deserialization

The deserialization and inference of timestamps with timezone assumes RFC3339. Users may extend the inferer and deserializer to handle other formats.

This also fixes a bug on which datetimes were being parsed as Date64, when date64 is expected to be a multiple of 86400000 and not an arbitrary number. They are now inferer and parsed as a Timestamp(Millisecond, None)

@jorgecarleitao jorgecarleitao added the enhancement An improvement to an existing feature label Oct 1, 2021
@jorgecarleitao jorgecarleitao changed the title Improved inference of CSV Improved inference and deserialization of CSV Oct 1, 2021
@codecov
Copy link

codecov bot commented Oct 1, 2021

Codecov Report

Merging #483 (8a90ece) into main (c098494) will decrease coverage by 0.02%.
The diff coverage is 78.78%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #483      +/-   ##
==========================================
- Coverage   80.28%   80.25%   -0.03%     
==========================================
  Files         374      374              
  Lines       22949    23753     +804     
==========================================
+ Hits        18424    19063     +639     
- Misses       4525     4690     +165     
Impacted Files Coverage Δ
src/io/csv/read/reader.rs 47.36% <ø> (-15.79%) ⬇️
src/io/csv/read/deserialize.rs 63.86% <59.25%> (+6.43%) ⬆️
src/io/csv/read/infer_schema.rs 85.71% <81.48%> (+0.42%) ⬆️
tests/it/io/csv/read.rs 100.00% <100.00%> (ø)
src/compute/cast/binary_to.rs 71.73% <0.00%> (-4.13%) ⬇️
src/compute/take/mod.rs 74.57% <0.00%> (-1.19%) ⬇️
src/io/avro/read/schema.rs 41.37% <0.00%> (-1.15%) ⬇️
src/temporal_conversions.rs 74.54% <0.00%> (-0.81%) ⬇️
src/compute/comparison/simd/mod.rs 95.12% <0.00%> (-0.72%) ⬇️
src/types/simd/native.rs 90.32% <0.00%> (-0.59%) ⬇️
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c098494...8a90ece. Read the comment docs.

@jorgecarleitao jorgecarleitao merged commit 2d84c2d into main Oct 2, 2021
@jorgecarleitao jorgecarleitao deleted the improve_csv branch October 2, 2021 07:34
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement An improvement to an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant