Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support of temporal column types #21

Open
sgoeschl opened this issue May 12, 2020 · 18 comments
Open

Support of temporal column types #21

sgoeschl opened this issue May 12, 2020 · 18 comments

Comments

@sgoeschl
Copy link

Hi Alexander,

I'm looking at "DataFrame" to use it in of my pet projects

  • How to support temporal column types, e.g. LocalDate or LocalDateTime? Want I want to do is to hide a parsed CSV or Excel sheet behind "DataFrame" but Excel has temporal column types
  • Did you have a look at Commons CSV? Since you strive for minimal dependencies it might be not an option but I wanted to ask :-)

Siegfried

@nRo
Copy link
Owner

nRo commented May 18, 2020

Hi Siegfried,

Thanks for your suggestions.
I plan to rework column types to allow easy definition of custom value types.
This would allow the definition of DateColumns.
Further, it would require the option to add hints for parsing (e.g. date format within a column).

A bridge for Commons CSV is a great idea to improve file format support.

@sgoeschl
Copy link
Author

Hi Alexander,

drop me a note when there is something to integrate - seems that I some unexpected free time in summer due to COVID-19 :-)

Siegfried

@nRo
Copy link
Owner

nRo commented May 19, 2020

Hi Siegfried,

Help is always greatly appreciated :)

I hope that I can work on custom value type support this weekend.
Further, I will try to create issues for other features and enhancements to better track progress.

@sgoeschl
Copy link
Author

I'm ready to kick into action - https://issues.apache.org/jira/browse/FREEMARKER-144

@nRo
Copy link
Owner

nRo commented May 31, 2020

So I started working on custom value type support in this branch: value-type-abstraction and create an issue #22
There are still some TODOs to support custom value types, described in #22

This approach would allow the creation of a Date ValueType that can support different formats for parsing and writing

@sgoeschl
Copy link
Author

sgoeschl commented Jun 1, 2020

I got a proof of concept running integrating "nRo/Dataframe" into Apache FreeMarker CLI

@nRo
Copy link
Owner

nRo commented Jun 2, 2020

Thats great!

Regarding temporal column types.
How would you expect date formats are handled per default?
Some kind of auto detection, or a default format that can be changed by giving hints when the data is read.

@sgoeschl
Copy link
Author

sgoeschl commented Jun 3, 2020

You make the decision - both approaches sound good to me :-)

What I want to do:

  • Transform Apache Commons CSV into a DataFrame and here I only handle Strings (partly implemented)
  • Transform a Map into a DataFrame (partly implemented)
  • Transform an Excel sheet (POI & getDateCellValue) to a DataFrame

So I guess I come from side of defining the DataFrame and then populating it

@nRo
Copy link
Owner

nRo commented Jun 7, 2020

Okay, thanks for the feedbacks.
Unfortunately, I don't have much time at the moment.
But I will keep working on that over the next days.

@sgoeschl
Copy link
Author

Mostly finished but not really happy with the code :-)

@sgoeschl
Copy link
Author

Is the feature branch stable enough to do some preliminary integration?

@nRo
Copy link
Owner

nRo commented Jun 18, 2020

The feature branch is pretty stable.
It only needs support for autodetection of custom types and some tests.

@nRo
Copy link
Owner

nRo commented Jul 26, 2020

sorry for the delay, the custom value feature branch is now merged into master #22 .

I will now start looking into temporal column types.
I suppose that three different types would be required:

  • Date
  • DateTime
  • Timespan

Each should be able to handle different formats.
Possible operations for these tables would be:

  • convert them from one type into another
  • operations for different time units (add minutes, hours, days,...)
  • operations between columns (calculate difference,...)

Any other suggestions?

@sgoeschl
Copy link
Author

sgoeschl commented Aug 5, 2020

Hi, little bit confused by "Timespan" - is this a time-only value, e.g. "13:42:23" or do you mean "3:24h"?

Regarding operations - little to no suggestion from my side - I would mostly filter / sort / query on temporal columns coming from CSV and Excel.

@nRo
Copy link
Owner

nRo commented Aug 11, 2020

Hi,
with "Timespan" I mean a type like Java 8 Duration.
Subtraction of two Dates would result in a Timespan for example

@sgoeschl
Copy link
Author

Ack - assuming that I understood the things correctly

  • A column to be read would consist of either Date (e.g. 14.10.2019) or DateTime in some format (e.g. 2019-10-14T12:00:00)
  • Substractions of dates would result in a timespan

A few questions along the line

  • Is it possible to read timespans from CSV? Usually they don't have a qualifier such as seconds or day as cell values
  • Would timestamps be supported, e .g. "12:01:31"?

@nRo
Copy link
Owner

nRo commented Aug 22, 2020

sorry for the delay again

  • A column to be read would consist of either Date (e.g. 14.10.2019) or DateTime in some format (e.g. 2019-10-14T12:00:00)

exactly. some points I am still not sure about:

  • how to pass format information to the CSV parser (should be rather simple)
  • implement autodetection for temporal columns
  • How to handle Timezones
  • Substractions of dates would result in a timespan

yes. Thats how I would do it. So that if two dates are extracted you get information about how many hours, minutes, seconds,... passed between those dates.

  • Is it possible to read timespans from CSV? Usually they don't have a qualifier such as seconds or day as cell values
  • Would timestamps be supported, e .g. "12:01:31"?

Timestamps could be supported by adding parsing hints to the CSV parser.
They could either just be epoch miliseconds (Long) or a format like you described.

Autodetection for timestamps is more difficult. A timestamp could also be a Long column.

@sgoeschl
Copy link
Author

sgoeschl commented Aug 23, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants