Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement datetime columns #1646

Closed
st-pasha opened this issue Feb 8, 2019 · 3 comments
Closed

Implement datetime columns #1646

st-pasha opened this issue Feb 8, 2019 · 3 comments
Assignees
Labels
cust-goldmansachs EPIC ⭐ Big task that may encompass many smaller ones new feature Feature requests for new functionality

Comments

@st-pasha
Copy link
Contributor

st-pasha commented Feb 8, 2019

This is a request to new stypes and functions in support of date/time functionality.

Currently, different packages use different formats for dates and times:

Apache Arrow

  • date[day] (32bit, days since the Unix epoch)
  • date[ms] (64bit, milliseconds since the Unix epoch)
  • time[s] (32bit)
  • time[ms] (32bit)
  • time[µs] (64bit)
  • time[ns] (64bit)
  • timestamp[s] (64bit)
  • timestamp[ms] (64bit)
  • timestamp[µs] (64bit)
  • timestamp[ns] (64bit)

When reading from csv, pyarrow parses date/time columns as timestamp[s]. Timestamps with milliseconds are read as strings. Some common date formats like MM/DD/YYYY are not recognized.
The arrow<->pandas conversion guide says that all timestamp[*] formats are converted to pd.Timestamp (np.datetime64[ns]), and all date[*] become object column with datetime.date items.

Pandas

In pandas, there are standalone classes such as pd.Timestamp, pd.Period, pd.DateOffset and pd.Timedelta, but also column-like DatetimeIndex (with dtype datetime64[ns]), and also a Series with dtype datetime64[ns].

When reading from CSV the datetime columns are not parsed, and remain as strings (objects). Conversion can be performed afterwards using to_datetime() function.

Numpy

In numpy all dates are 64-bit, but with various time units: Y, M, W, D, h, m, s, ms, us, ns, ps, fs, as. All datetimes are based on POSIX time with epoch of 1970-01-01T00:00Z.
In numpy 1.6 the default format when parsing was datetime64[us]; since 1.7 the format is selected based on the string.

Python

The datetime module supports classes datetime.date, datetime.time, datetime.datetime and datetime.timedelta. All of these are relatively "heavy" objects, each storing multiple fields: .date has year, month, day; time has hour, min, sec, microsec and timezone; etc.

There is also C-equivalent of datetime module; there the .date object is 4+N bytes, .time is 6+N bytes, and .datetime is 10+N bytes, where N=17, a per-object overhead.

See Also

@st-pasha st-pasha added the improve Improvement of an existing functionality label Feb 8, 2019
@st-pasha st-pasha added this to the Release 0.9.0 milestone Feb 8, 2019
@st-pasha st-pasha self-assigned this Feb 8, 2019
@st-pasha st-pasha added new feature Feature requests for new functionality and removed improve Improvement of an existing functionality labels Feb 19, 2019
@XiaomoWu
Copy link

XiaomoWu commented May 2, 2019

IMHO, every option is better than the pandas way. In Rdatatable, date or time is stored as an R object like POSIXct or Date (that's equivalent to the "Python" solution above), or stored internally as an integer like IDate, ITime (that equals to implementing pydatatable's own datetime types).

@st-pasha
Copy link
Contributor Author

Prerequisite: #1396

@st-pasha st-pasha removed this from the Release 0.10.0 milestone Nov 7, 2019
@st-pasha st-pasha mentioned this issue Jan 4, 2020
27 tasks
@st-pasha st-pasha added the EPIC ⭐ Big task that may encompass many smaller ones label Aug 20, 2020
@st-pasha st-pasha added this to To Do in Sprint Aug 18 - Sep 7 via automation Aug 21, 2020
@st-pasha st-pasha added this to the Release 0.11.0 milestone Aug 21, 2020
@st-pasha st-pasha removed this from To Do in Sprint Aug 18 - Sep 7 Aug 21, 2020
@st-pasha st-pasha removed this from the Release 0.11.0 milestone Aug 24, 2020
@st-pasha st-pasha removed their assignment Sep 24, 2020
st-pasha added a commit that referenced this issue Feb 18, 2021
Created column type `date32` for storing calendar dates. Currently only the following operations are supported:
- creation from python `datetime.date` objects;
- converting into python;
- `repr()`, i.e. the  column can be viewed in a console.

WIP for #1646
@st-pasha st-pasha mentioned this issue Feb 19, 2021
21 tasks
@st-pasha
Copy link
Contributor Author

Done.

@st-pasha st-pasha self-assigned this Jun 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cust-goldmansachs EPIC ⭐ Big task that may encompass many smaller ones new feature Feature requests for new functionality
Projects
None yet
Development

No branches or pull requests

3 participants