Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add String datatype to stage2 #68

Merged
merged 5 commits into from Nov 21, 2022
Merged

Conversation

ellmau
Copy link
Member

@ellmau ellmau commented Nov 21, 2022

Adding String as a variant to DataTypeName and follow up all the necessary adaptions to promote the new datatype to all the implementations.

To avoid datatype conversions and additional checks, the new datatype uses usize internally to have the same datatype and range as it is provided by the currently implemented dictionaries.

TODO: add a printer-object such that the internal numbers can be translated back to the original String. (This will probably be a new issue after this PR is accepted)
implements #63

@ellmau ellmau added enhancement New feature or request technical-debt Refactoring for an implementation that was good enough at the time labels Nov 21, 2022
@ellmau ellmau requested a review from mmarx November 21, 2022 09:58
@ellmau ellmau self-assigned this Nov 21, 2022
@ellmau ellmau linked an issue Nov 21, 2022 that may be closed by this pull request
Comment on lines +404 to +422
Self::Item::Float(_) => None,
Self::Item::Double(_) => None,
Self::Item::String(_) => None,
},
Self::Float(cs) => match value {
Self::Item::U64(_val) => None,
Self::Item::U64(_) => None,
Self::Item::Float(val) => cs.seek(val).map(DataValueT::Float),
Self::Item::Double(_val) => None,
Self::Item::Double(_) => None,
Self::Item::String(_) => None,
},
Self::Double(cs) => match value {
Self::Item::U64(_val) => None,
Self::Item::Float(_val) => None,
Self::Item::U64(_) => None,
Self::Item::Float(_) => None,
Self::Item::Double(val) => cs.seek(val).map(DataValueT::Double),
Self::Item::String(_) => None,
},
Self::String(cs) => match value {
Self::Item::String(val) => cs.seek(val).map(DataValueT::String),
_ => None, // no type mixing allowed, so in any other case it should be [None]
Copy link
Member Author

@ellmau ellmau Nov 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By changing _val to _ the memory footprint is reduced, as _val needs to stay on the stack while still in the current scope. The memory of _ is freed instantly.

src/physical/dictionary.rs Outdated Show resolved Hide resolved
@@ -58,6 +58,7 @@ impl<'a> TrieUnion<'a> {
DataTypeName::U64 => init_scans_for_datatype!(U64, u64),
DataTypeName::Float => init_scans_for_datatype!(Float, Float),
DataTypeName::Double => init_scans_for_datatype!(Double, Double),
DataTypeName::String => init_scans_for_datatype!(String, usize),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be helpful if each DataTypeName knows its internal interpretation, so this information can be extracted "automatically" if it is needed more often.

@mmarx
Copy link
Member

mmarx commented Nov 21, 2022

To avoid datatype conversions and additional checks, the new datatype uses usize internally to have the same datatype and range as it is provided by the currently implemented dictionaries.

I'm not too happy with this. On 64-bit architectures, usize and u64 have the same representation anyways, but on 32-bit architectures, usize is only 32 bits wide, so now the size of the variant depends on the architecture.

@ellmau
Copy link
Member Author

ellmau commented Nov 21, 2022

To avoid datatype conversions and additional checks, the new datatype uses usize internally to have the same datatype and range as it is provided by the currently implemented dictionaries.

I'm not too happy with this. On 64-bit architectures, usize and u64 have the same representation anyways, but on 32-bit architectures, usize is only 32 bits wide, so now the size of the variant depends on the architecture.

I see your concern. It is possible to refactor the Dictionary trait to be either u64 or u32.
My thoughts have been to utilize the native number representation of the machine.

src/io/csv.rs Outdated Show resolved Hide resolved
src/physical/datatypes/data_value.rs Outdated Show resolved Hide resolved
src/physical/datatypes/data_value.rs Outdated Show resolved Hide resolved
src/physical/dictionary.rs Outdated Show resolved Hide resolved
@mmarx
Copy link
Member

mmarx commented Nov 21, 2022

I see your concern. It is possible to refactor the Dictionary trait to be either u64 or u32.
My thoughts have been to utilize the native number representation of the machine.

usize is not about being the native representation, its sole point is to cover the whole range of allowable pointer values. Even i386 has 64-bit-registers, so u64 could also be considered “native” there.

@mmarx
Copy link
Member

mmarx commented Nov 21, 2022

Apart from this discussion on usize/u64 (which we should probably decide on in the group) and the small changes above, this looks good, though.

@ellmau
Copy link
Member Author

ellmau commented Nov 21, 2022

Apart from this discussion on usize/u64 (which we should probably decide on in the group) and the small changes above, this looks good, though.

Thank you for the thoughtful review, did all the changes except the usize vs u32/u64 issue

Copy link
Member

@mmarx mmarx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two small typos, otherwise fine :)

src/io/csv.rs Outdated Show resolved Hide resolved
src/physical/datatypes/data_value.rs Outdated Show resolved Hide resolved
@ellmau
Copy link
Member Author

ellmau commented Nov 21, 2022

Apart from this discussion on usize/u64 (which we should probably decide on in the group) and the small changes above, this looks good, though.

Thank you for the thoughtful review, did all the changes except the usize vs u32/u64 issue

I see your concern. It is possible to refactor the Dictionary trait to be either u64 or u32.
My thoughts have been to utilize the native number representation of the machine.

usize is not about being the native representation, its sole point is to cover the whole range of allowable pointer values. Even i386 has 64-bit-registers, so u64 could also be considered “native” there.

It is still something to consider when using indices, as indices are always usize. I wanted to avoid unnecessary casting back and forth when I've designed the Dictionary treat.

@ellmau ellmau merged commit 910d36c into main Nov 21, 2022
mmarx added a commit that referenced this pull request Nov 24, 2022
This reverts commit 910d36c.

As per discussion (#70), revert the string type.
@mmarx mmarx deleted the feature/63-add-string-datatype-handling branch April 26, 2023 11:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request technical-debt Refactoring for an implementation that was good enough at the time
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Setting Datatypes for Reading CSV
2 participants