Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement as_polars_series() for some clock package types #861

Merged
merged 11 commits into from
Mar 1, 2024

Conversation

eitsupi
Copy link
Collaborator

@eitsupi eitsupi commented Feb 28, 2024

Part of #591

It looks like it could be implemented on the R side only.
There is room for improvement in performance because of the use of when-then-otherwise to handle overflow when converting UInt64 to Int64, but I think it is a good starting point.

char_times = c(
  "1900-01-01 01:01:01.123456789",
  "2212-01-01 12:34:57.123456789",
  "3712-01-01 12:34:56.123456789"
)
fmt = "%Y-%m-%d %H:%M:%OS"
clock_times = list(
  ns = clock::naive_time_parse(char_times , format = fmt, precision = "nanosecond"),
  us =  clock::naive_time_parse(char_times , format = fmt, precision = "microsecond"),
  ms =  clock::naive_time_parse(char_times , format = fmt, precision = "millisecond"),
  s = clock::naive_time_parse(char_times , format = fmt, precision = "second"),
  d = clock::naive_time_parse(char_times , format = fmt, precision = "day")
)

clock_times
#> $ns
#> <naive_time<nanosecond>[3]>
#> [1] "1900-01-01T01:01:01.123456789" "2212-01-01T12:34:57.123456789"
#> [3] "1958-05-04T13:51:14.994801941"
#>
#> $us
#> <naive_time<microsecond>[3]>
#> [1] "1900-01-01T01:01:01.123456" "2212-01-01T12:34:57.123456"
#> [3] "3712-01-01T12:34:56.123456"
#>
#> $ms
#> <naive_time<millisecond>[3]>
#> [1] "1900-01-01T01:01:01.123" "2212-01-01T12:34:57.123"
#> [3] "3712-01-01T12:34:56.123"
#>
#> $s
#> <naive_time<second>[3]>
#> [1] "1900-01-01T01:01:01" "2212-01-01T12:34:57" "3712-01-01T12:34:56"
#>
#> $d
#> <naive_time<day>[3]>
#> [1] "1900-01-01" "2212-01-02" "3712-01-02"

lapply(clock_times, polars::as_polars_series)
#> $ns
#> polars Series: shape: (3,)
#> Series: '' [datetime[ns]]
#> [
#>  1900-01-01 01:01:01.123456789
#>  2212-01-01 12:34:57.123456789
#>  1958-05-04 13:51:14.994801941
#> ]
#>
#> $us
#> polars Series: shape: (3,)
#> Series: '' [datetime[μs]]
#> [
#>  1900-01-01 01:01:01.123456
#>  2212-01-01 12:34:57.123456
#>  3712-01-01 12:34:56.123456
#> ]
#>
#> $ms
#> polars Series: shape: (3,)
#> Series: '' [datetime[ms]]
#> [
#>  1900-01-01 01:01:01.123
#>  2212-01-01 12:34:57.123
#>  3712-01-01 12:34:56.123
#> ]
#>
#> $s
#> polars Series: shape: (3,)
#> Series: '' [datetime[ms]]
#> [
#>  1900-01-01 01:01:01
#>  2212-01-01 12:34:57
#>  3712-01-01 12:34:56
#> ]
#>
#> $d
#> polars Series: shape: (3,)
#> Series: '' [datetime[ms]]
#> [
#>  1900-01-01 00:00:00
#>  2212-01-02 00:00:00
#>  3712-01-02 00:00:00
#> ]

Created on 2024-02-28 with reprex v2.0.2

@eitsupi eitsupi requested review from sorhawell and removed request for sorhawell February 28, 2024 14:32
@eitsupi eitsupi changed the title feat: implement as_polars_series(<clock_time_point>) feat: implement as_polars_series() for some clock package types Feb 28, 2024
@eitsupi
Copy link
Collaborator Author

eitsupi commented Feb 28, 2024

If this is less performant than string parsing, it is not worth the complexity of the implementation so I will do some benchmarking later.

@eitsupi
Copy link
Collaborator Author

eitsupi commented Feb 29, 2024

In my environment, this implementation seems to be faster for more than 100,000 rows.

library(clock)
library(polars)

time_clock <- seq_len(10^5) |>
  as.POSIXct(tz = "UTC") |>
  as_zoned_time()

bench::mark(
  via_str = {
    time_clock |>
      as.character() |>
      (\(x) as_polars_series(x)$str$
        replace(r"(\[.*\])", "")$str$
        strptime(
        pl$Datetime("ms"), "%Y-%m-%dT%H:%M:%S%:z"
      ))()
  },
  via_double = {
    as_polars_series(time_clock)
  },
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 via_str       712ms    712ms      1.40     6.9MB     1.40
#> 2 via_double    127ms    145ms      7.01    5.98MB     1.75

Created on 2024-02-29 with reprex v2.0.2

Copy link
Collaborator

@etiennebacher etiennebacher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really no expert on datetime handling so I don't have much to say about this, but LGTM

NEWS.md Show resolved Hide resolved
@etiennebacher
Copy link
Collaborator

Just saw the conversation in the linked issue to the clock repo, do you want to use strings instead?

@eitsupi
Copy link
Collaborator Author

eitsupi commented Mar 1, 2024

Just saw the conversation in the linked issue to the clock repo, do you want to use strings instead?

I think we can leave it as is for now and replace it with an implementation via string when the clock is updated because of the performance benefits as commented above.

@eitsupi eitsupi merged commit f859b60 into main Mar 1, 2024
18 checks passed
@eitsupi eitsupi deleted the as_polars_series-clock branch March 1, 2024 04:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants