Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New implementations of times.parse & times.format #8094

Merged
merged 2 commits into from
Jul 9, 2018

Conversation

GULPF
Copy link
Member

@GULPF GULPF commented Jun 23, 2018

This PR contains new improved implementations of times.parse and times.format.

  • More powerful mini-language
  • Better performance
  • Uses static[string] to validate layout at compile time when possible
  • More consistent error handling, should now always raise a ValueError when something goes wrong
  • Several minor bugs/limitations fixed

Changes in the layout mini-language

  • All z* patterns now output 'Z' for UTC
  • Added g pattern for era (AD or BC)
  • Added zzzz pattern for UTC offset including seconds
  • Added uuuu pattern for astronomical year padded to four digits
  • Added UUUU pattern for astronomical year without padding
  • Added YYYY pattern for year without padding
  • Deprecated y, yyy and yyyyy. These patterns are not useful and complicate the mini-language for no good reason.

The uuuu/yyyy patterns now prepend a '+' when the number of digits in the year is more than four (unless it's uuuu and the year is negative). This way, the iso format yyyyMMdd works as long as the year is in the range 1..9999. This behavior is consistent with Java (and maybe other languages as well).

Non-patterns that aren't separators must now always be surrounded by '. This was the document behavior before as well, but the old implementation allowed non-quoted text anyway.

Benchmark

The new implementations perform significantly better, especially parse. Naive benchmarch.

Result:

  Before:
    parse(x, "yyyy-MM-dd'T'HH:mm:sszzz", utc())     303 ms
    parse(x, "yyyy-MM-dd'T'HH:mm:sszzz", local())   394 ms
    format(x, "yyyy-MM-dd'T'HH:mm:sszzz")           147 ms

  After:
    parse(x, "yyyy-MM-dd'T'HH:mm:sszzz", utc())     130 ms
    parse(x, "yyyy-MM-dd'T'HH:mm:sszzz", local())   224 ms
    format(x, "yyyy-MM-dd'T'HH:mm:sszzz")           112 ms

Fixes #7017
Fixes #7189

@GULPF
Copy link
Member Author

GULPF commented Jun 23, 2018

I forgot that the code must survive bootstraping, so still some work remaining

@GULPF GULPF force-pushed the new-parse-format branch 4 times, most recently from a5d62bb to 639814e Compare June 23, 2018 21:54
@jyapayne
Copy link
Contributor

This is awesome! I love the new mini-language!

Copy link
Contributor

@Varriount Varriount left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very impressive work! This is awesome! I feel like the times module is shaping up to be a real gem.

General Suggestions:

  • Prudent use of toOpenArray might improve performance even more. That being said, there might be term-rewriting macros in the future that will do this automatically.
  • Personally, I would break up some of the parsing/formatting routines into multiple sub-procedures (parsePattern in particular). Thoughts?
  • I wonder if some of the internal type names are too generic (Token, etc.).

@@ -7,35 +7,127 @@
# distribution, for details about the copyright.
#

##[
This module contains routines and types for dealing with time using a proleptic Gregorian calendar.
It's is available for the `JavaScript target <backends.html#the-javascript-target>`_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: Should be "it's also available"

This module contains routines and types for dealing with time using a proleptic Gregorian calendar.
It's is available for the `JavaScript target <backends.html#the-javascript-target>`_.

The types uses nanosecond time resolution, but the underlying resolution used by ``getTime()``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Although the types use nanosecond"
", the underlying"

echo "An hour from now : ", now() + 1.hours
echo "An hour from (UTC) now: ", getTime().utc + initDuration(hours = 1)

Parsing and formatting dates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs title-case

============= ================================================================================= ================================================
Pattern Description Example
============= ================================================================================= ================================================
``d`` Numeric value of the day of the month, it will be one or two digits long. | ``1/04/2012 -> 1``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"will be either one or"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Numeric value representing the day"

============= ================================================================================= ================================================
``d`` Numeric value of the day of the month, it will be one or two digits long. | ``1/04/2012 -> 1``
| ``21/04/2012 -> 21``
``dd`` Same as above, but always two digits. | ``1/04/2012 -> 01``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"but is always"

@@ -36,8 +36,8 @@ let utcPlus2 = Timezone(zoneInfoFromUtc: staticZoneInfoFromUtc, zoneInfoFromTz:
block timezoneTests:
let dt = initDateTime(01, mJan, 2017, 12, 00, 00, utcPlus2)
doAssert $dt == "2017-01-01T12:00:00+02:00"
doAssert $dt.utc == "2017-01-01T10:00:00+00:00"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there so fewer tests here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean why there are so few tests in tests/js/ttimes.nim? The times tests should probably just be merged into a single file that runs for both C & JS. I can fix it in a separate PR.


proc toDateTime(p: ParsedTime, zone: Timezone, f: TimeLayout,
input: string): DateTime =
var month = mJan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to merge the declarations and assignments here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that originally, but then the compiler can't prove that month is initialized for some reason

else:
result = false
of y, yyy, yyyyy:
raise newException(ValueError, "The pattern '" & $pattern & "' " &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strformat's fmt/& can be used here.

else:
result = false
of g:
if input[i..i+1].cmpIgnoreCase("BC") == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Suggestion) These could possibly be optimized through use of the new toOpenArray proc.

if result:
i.inc 3
of dddd:
if input.substr(i, i+5).cmpIgnoreCase("sunday") == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These could possibly be optimized through use of toOpenArray.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string comparisons could definitely be optimized further, but I'll leave it for another time.

@GULPF
Copy link
Member Author

GULPF commented Jun 24, 2018

Thanks for the review, it should now be addressed :) Some code could probably be extracted from parsePattern into helpers, but I don't know if it would improve readability much since parsePatternwill still have a huge case-statement.

@Araq
Copy link
Member

Araq commented Jun 25, 2018

Fails with

./koch nimsuggest
bin/nim c --noNimblePath -d:release -p:compiler nimsuggest/nimsuggest.nim
�[32mHint: �[0mused config file '/home/travis/build/nim-lang/Nim/config/nim.cfg'�[36m [Conf]�[0m
�[32mHint: �[0mused config file '/home/travis/build/nim-lang/Nim/nimsuggest/nimsuggest.nim.cfg'�[36m [Conf]�[0m
�[32mHint: �[0msystem�[36m [Processing]�[0m
�[32mHint: �[0mnimsuggest�[36m [Processing]�[0m
�[32mHint: �[0mstrutils�[36m [Processing]�[0m
�[32mHint: �[0mparseutils�[36m [Processing]�[0m
�[32mHint: �[0mmath�[36m [Processing]�[0m
�[32mHint: �[0mbitops�[36m [Processing]�[0m
�[32mHint: �[0malgorithm�[36m [Processing]�[0m
�[32mHint: �[0municode�[36m [Processing]�[0m
�[32mHint: �[0mos�[36m [Processing]�[0m
�[32mHint: �[0mtimes�[36m [Processing]�[0m
�[32mHint: �[0moptions�[36m [Processing]�[0m
�[32mHint: �[0mtypetraits�[36m [Processing]�[0m
�[32mHint: �[0mstrformat�[36m [Processing]�[0m
�[32mHint: �[0mmacros�[36m [Processing]�[0m
�[32mHint: �[0mposix�[36m [Processing]�[0m
�[1mlib/pure/times.nim(2201, 53) �[0mtemplate/generic instantiation from here�[0m
�[1mlib/pure/times.nim(1657, 11) �[0m�[31mError: �[0mcan raise an unlisted exception: ref ValueError�[0m
FAILURE

result.add $dt.second
of ss:
result.add dt.second.intToStr(2)
of fff, ffffff, fffffffff:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats not a good solution. What happens, if you don't have as many nanosecond-digits as requested in the format-string?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea what I though when I implemented it like that... Thanks for catching it

Copy link
Contributor

@dom96 dom96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but some things that I would like to see changed, mainly bikeshedding :)


TimeLayout* = object ## Represents a format for parsing and printing
## time types.
patterns: seq[byte] ## \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this encoded as bytes? Wouldn't seq[LayoutPattern] make more sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think LayoutPattern should be called TimePattern. It doesn't have much to do with the layout of the pattern so I'm not sure why you named it this way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now noticed that you are referring to these format specifiers as "layout patterns" which is just confusing to me. Layout to me means "add 5 spaces before this string" or "indent and wrap these two lines so that they fit 80 characters", it's not about formatting time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename all of these types and use the word "Format" instead of "Layout"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I wanted to avoid using "format" is because of the ambiguity (since it can be both a verb and a noun in this context). But English isn't my first language and you're probably right that using "format" anyway is better :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this encoded as bytes? Wouldn't seq[LayoutPattern] make more sense?

See the doc comment for this field. Basically TimeLayout.patterns not only contains LayoutPattern values, but also arbitrary bytes that are treated as text. This is a bit hackish, but it seems to performs well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the doc comment for this field. Basically TimeLayout.patterns not only contains LayoutPattern values, but also arbitrary bytes that are treated as text. This is a bit hackish, but it seems to performs well.

Isn't this ambiguous? dddd.byte == 3.byte? What if I want \3 in my string?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this ambiguous? dddd.byte == 3.byte? What if I want \3 in my string?

Each literal sequence is prefixed by LayoutPattern.Lit and the length of the literal sequence

## be encoded as ``@[Lit.byte, 3.byte, 'f'.byte, 'o'.byte, 'o'.byte]``.
layout: string

const LayoutPatternSeperators = { ' ', '-', '/', ':', '(', ')', '[', ']', ',' }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this means that these characters separate the time format pattern from a different layout pattern that can be used to lay out the time string (similar to how floats can be indented etc.)

Please change this naming scheme. These should be called PatternLiterals or something.


currentF = ""
template yieldcurrToken() =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: yieldCurrToken


yieldcurrToken()

proc stringToPattern(str: string): LayoutPattern =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified into parseEnum[LayoutPattern](str).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parseEnum is case insensitive, which doesn't work for this enum

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parseEnum[LayoutPattern](str.toLower())? :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that parseEnum doesn't care about case at all, see #7686. LayoutPattern has values that only differs in case.

var year: int
var monthday: int
(year, month, monthday) =
if p.year.isNone or p.month.isNone or p.monthday.isNone:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if is unnecessary, you can just use the first branch or is this just an optimisation to prevent calling now unnecessarily?

If so, please add a comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or is this just an optimisation to prevent calling now unnecessarily?

Bingo, now is quite expensive. I'll add a comment.

result = format(dt, "yyyy-MM-dd'T'HH:mm:sszzz") # todo: optimize this
except ValueError: assert false # cannot happen because format string is valid
doAssert $dt == "2000-01-01T12:00:00Z"
result = format(dt, "yyyy-MM-dd'T'HH:mm:sszzz") # todo: optimize this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does format no longer raise? Also, maybe we can just remove that "TODO" now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The static[T] overloads means that errors in the format string are cough at compile time, so if the format string is known at compile time format wont raise any exception

@timotheecour
Copy link
Member

timotheecour commented Jul 7, 2018

A bit late to the party but just wanted to mention this to make sure this was considered:

  • This design is not as flexible as could be, eg, inserting a runtime-defined character inside the format string is a bit awkward
    eg:
var str : string = getString()
# to use a runtime `str` instead of fixed string  `'T'` in format(x, "yyyy-MM-dd'T'HH:mm:sszzz") we'd need:
format(x, "yyyy-MM-dd'" & std.escapeSingleQuote & "'HH:mm:sszzz")
  • Also, it feels more magical (harder to distinguish special date variables in the string) and is inconsistent with strformat strings:fmt"hello my name is {str}"

  • common libraries also use a technique analog to strformat where the special variables ( eg MM) are denoted as special instead of the other way around, eg in python d.strftime("%d/%m/%y")

My suggestion was instead doing this:

var some_variable : string = getString()
format(x, " some_inline_string {yyyy}-{MM}-{dd}{some_variable}{HH}:{mm}:{sszzz}")

which could be implemented in terms of strformat.fmt

@GULPF
Copy link
Member Author

GULPF commented Jul 7, 2018

This design is not as flexible as could be, eg, inserting a runtime-defined character inside the format string is a bit awkward

It's definitely awkward, but what's the use case? times.format should not be used for general string formatting, that's what strformat is for. I can't imagine a date time format that requires interpolation with a runtime string.

IMO times.format should be used from strformat.fmt, not the other way around. This is already possible:

import strformat, times
let dt = now()
echo fmt"Date: {dt:MMMM yyyy}"

case f[i]
of '\'':
yieldcurrToken()
if f[i.succ] == '\'':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing index check.

inc(i)
else: result.add(f[i])

while f[i] != '\'' and i < f.high:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check in the wrong order.

@dom96
Copy link
Contributor

dom96 commented Jul 9, 2018

Isn't this ambiguous? dddd.byte == 3.byte? What if I want \3 in my string?

Each literal sequence is prefixed by LayoutPattern.Lit and the length of the literal sequence

Right, I would say an object variant would be better. Is there a reason you can't use that?

It's cool though, we can merge this and fix this later if necessary.

@GULPF
Copy link
Member Author

GULPF commented Jul 9, 2018

Right, I would say an object variant would be better. Is there a reason you can't use that?

The nice thing about the current design is that it will never use more than a single ref. An object variant would require additional ref's. I would affect performance, but maybe not by much.

@Araq
Copy link
Member

Araq commented Jul 9, 2018

The nice thing about the current design is that it will never use more than a single ref. An object variant would require additional ref's. I would affect performance, but maybe not by much.

These packed representations based on seq[byte] are the future, please keep it this way.

@Araq
Copy link
Member

Araq commented Jul 9, 2018

Unrelated CI failures. Merging.

@Araq Araq merged commit 3b310e9 into nim-lang:devel Jul 9, 2018
@dom96
Copy link
Contributor

dom96 commented Jul 10, 2018

These packed representations based on seq[byte] are the future, please keep it this way.

A data structure/type/DSL that maps to a seq[byte] perhaps, what I really dislike is the lack of type safety in the current approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Times module lacks a way to generate 'Z' [RFC] Parsing and formatting dates
7 participants