Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Extract parts of datetime #433

Merged
merged 17 commits into from Sep 26, 2021

Conversation

VasanthakumarV
Copy link
Contributor

@VasanthakumarV VasanthakumarV commented Sep 21, 2021

This PR extends the datetime extract methods to support:

  • year (previously supported)
  • month
  • day
  • weekday
  • iso_week
  • hour (previously supported)
  • minute
  • second
  • nanosecond

to Date32, Date64, Time32(_) and Time64(_), Timestamp(_, None) and Timestamp(_, Some(_)), thereby offering broad support to use dates, times and timestamps. For timestamps with timezone, these methods do take leap days (Feb 29th) and hours (summer time) into account.

Closes #415

@VasanthakumarV VasanthakumarV marked this pull request as draft September 21, 2021 09:42
Copy link
Contributor Author

@VasanthakumarV VasanthakumarV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgecarleitao I have added only month and day till now, wanted to get your feedback before I proceed further,

The idea is to combine DateLike components under one macro_rule, and TimeLike components on another.

@@ -246,63 +228,90 @@ pub fn can_hour(data_type: &DataType) -> bool {
)
}

macro_rules! date_like {
($component:ident, $array:ident, $data_type:path, $chrono_tz:ident) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having four arguments on a macro_rule, is definitely not very readable, I will try to bring this number down

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to convert this to a generic with parameters A, D: DateLike F: Fn(D) -> A, and pass |x| x.year(), so that we do not rely on macros?

@VasanthakumarV VasanthakumarV changed the title [WIP] Extract parts of Date other than hour and year [WIP] Extract parts of Datetime other than hour and year Sep 21, 2021
@codecov
Copy link

codecov bot commented Sep 21, 2021

Codecov Report

Merging #433 (9a85313) into main (7dedd02) will decrease coverage by 0.77%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #433      +/-   ##
==========================================
- Coverage   80.78%   80.00%   -0.78%     
==========================================
  Files         372      371       -1     
  Lines       22651    22833     +182     
==========================================
- Hits        18299    18268      -31     
- Misses       4352     4565     +213     
Impacted Files Coverage Δ
src/compute/temporal.rs 94.39% <100.00%> (+8.41%) ⬆️
tests/it/compute/temporal.rs 100.00% <100.00%> (ø)
src/alloc/mod.rs 0.00% <0.00%> (-93.03%) ⬇️
src/array/dictionary/mod.rs 46.66% <0.00%> (-12.16%) ⬇️
src/compute/arithmetics/basic/div.rs 81.66% <0.00%> (-10.65%) ⬇️
src/types/mod.rs 22.22% <0.00%> (-6.10%) ⬇️
src/array/boolean/ffi.rs 0.00% <0.00%> (-5.89%) ⬇️
src/array/union/mod.rs 75.22% <0.00%> (-5.78%) ⬇️
src/array/dictionary/mutable.rs 76.66% <0.00%> (-5.48%) ⬇️
src/array/list/mutable.rs 70.96% <0.00%> (-4.90%) ⬇️
... and 62 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7dedd02...9a85313. Read the comment docs.

Copy link
Owner

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good!

I am not sure ab out my comment, but I think it is worth a try to convert the macro to a generic with a function takes components out of the datelike object, but maybe it is not possible.

In general generics are more expressive and easier to read because they pinpoint the exact constraints (Trait) that are required to uphold the functionality.

@@ -246,63 +228,90 @@ pub fn can_hour(data_type: &DataType) -> bool {
)
}

macro_rules! date_like {
($component:ident, $array:ident, $data_type:path, $chrono_tz:ident) => {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to convert this to a generic with parameters A, D: DateLike F: Fn(D) -> A, and pass |x| x.year(), so that we do not rely on macros?

@VasanthakumarV
Copy link
Contributor Author

VasanthakumarV commented Sep 22, 2021

@jorgecarleitao I quite wasn't able to reach A, D: DateLike F: Fn(D) -> A, I had to settle for
A, F: Fn(NaiveDateTime) -> A, for types that doesn't fit this pattern
(DataType::Timestamp(time_unit, Some(timezone_str)) requires Fn(DateTime<FixedOffset>) and Fn(DateTime<Tz>)
I had to repeat code across year, month and day.

I wasn't able to make trait bounds work on closures arguments, I tried D: DateLike, fn(D) -> A once, but failed.

I will also have to fix chrono_tz feature gating issue.

@VasanthakumarV
Copy link
Contributor Author

Let me investigate more and reduce the boilerplate.

@VasanthakumarV
Copy link
Contributor Author

@jorgecarleitao Do you think Fn(&dyn DateLike) -> O is a good idea,

Since at compile time we won't know if the input will be NaiveTime, NaiveDateTime, DateTime<Tz> or DateTime<Offset>, we might have to resort to dynamic dispatch.

Also, this could be costly because it is applied to each element of the array, but I am not sure.


Currently, we create one closure for each type with the help of match statements, in each hour, minute, year ... functions.

pub fn hour(array: &dyn Array) -> Result<PrimitiveArray<u32>> {
    match array.data_type() {
        DataType::Date32 | DataType::Date64 | &DataType::Timestamp(_, None) => {
            date_like(array, DataType::UInt32, |x| x.hour())
        }
        DataType::Time32(_) | DataType::Time64(_) => {
            time_like(array, DataType::UInt32, |x| x.hour())
        }
        DataType::Timestamp(time_unit, Some(timezone_str)) => {
            ...
            if let Ok(timezone) = timezone {
                Ok(extract_impl(array, time_unit, timezone, |x| x.hour()))
            } else {
                chrono_tz(array, time_unit, timezone_str, |x| x.hour())
            }
        }
    }
}

@jorgecarleitao
Copy link
Owner

Wow, really good changes.

I think it is worth trying the dyn: I would create a small benchmark and see the impact. I suspect if we #[inline] the function, it won't have any impact, but I may be wrong ^_^

@VasanthakumarV
Copy link
Contributor Author

@jorgecarleitao Unfortunately Datelike and Timelike traits are not trait object safe, so dyn is not possible 😞.
We could create a new trait with the subset of methods we want, but implementing them on the four types can be too much.


I had to go back to macro_rules, but this time around it is minimal.

Copy link
Owner

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You clearly enjoy writing simple, readable code in Rust =) I went through it and it is very pleasant to read it. I left minor suggestions, but feel free to iterate on it until you are comfortable.

One thing missing is the tests. We do not need to cover all variations since we are using macros to not duplicate code. IMO having tests for the timestamp with and without timezone is important, in particular

  1. extract hour from a timestamp on a summer time. See here for an example where there is a summer time shift.
  2. extract day of a leap day (29th of Feb). See here for an example of one.

A note: formally second(date64) should be 0. It is a bit misleading in the arrow specification, but date64 is a multiple of 86400000 (see here) 🤯. IMO we should enforce it on the array construction, so it is fine to keep it like this here.

src/compute/temporal.rs Outdated Show resolved Hide resolved
src/compute/temporal.rs Outdated Show resolved Hide resolved
src/compute/temporal.rs Outdated Show resolved Hide resolved
src/compute/temporal.rs Outdated Show resolved Hide resolved
src/compute/temporal.rs Outdated Show resolved Hide resolved
Copy link
Contributor Author

@VasanthakumarV VasanthakumarV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @jorgecarleitao

nanosecond method hasn't been properly tested yet,

Do you think we should add or remove any of the time extraction methods?

tests/it/compute/temporal.rs Show resolved Hide resolved
tests/it/compute/temporal.rs Show resolved Hide resolved
@jorgecarleitao
Copy link
Owner

Let me know when you think it is ready (since it is still marked as a draft)

@VasanthakumarV
Copy link
Contributor Author

VasanthakumarV commented Sep 26, 2021

Let me know when you think it is ready (since it is still marked as a draft)

@jorgecarleitao Was trying just now to add a test that uses TimeUnit::Nanosecond, I wanted to check for today's date represented by Nanoseconds (163265170202000000000), Int64Array is not enough, Int128Array does not support Datestamp logical type, so wasn't sure what to do

I will try for Time64 with TimeUnit::Nanosecond

@VasanthakumarV VasanthakumarV marked this pull request as ready for review September 26, 2021 11:00
@VasanthakumarV VasanthakumarV changed the title [WIP] Extract parts of Datetime other than hour and year Extract parts of Datetime other than hour and year Sep 26, 2021
@jorgecarleitao jorgecarleitao added the feature A new feature label Sep 26, 2021
@jorgecarleitao jorgecarleitao changed the title Extract parts of Datetime other than hour and year Extract parts of datetime Sep 26, 2021
@jorgecarleitao jorgecarleitao merged commit 90887c2 into jorgecarleitao:main Sep 26, 2021
@jorgecarleitao
Copy link
Owner

Thank you again for this amazing work and PR 🙇 . I have modified the title and description to cover the great work done here so that anyone looking through the changelog can track it to this PR.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature A new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support to extract other parts of a date
2 participants