-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Unable to parse time zone" when reading a specific Parquet file (Parquet doesn't store time zones) #9586
Comments
thanks for the report - looks like In [3]: parquet_df = polars.read_parquet(r"sample_weird_tz.parquet", use_pyarrow=True)
...: print(parquet_df)
shape: (1, 1)
┌─────────────────────────┐
│ UTC_DATETIME_ID │
│ --- │
│ datetime[ns, UTC] │
╞═════════════════════════╡
│ 2023-06-26 14:15:00 UTC │
└─────────────────────────┘ will take a look |
Right, looks like the file's schema contains an offset in the place of a time zone: In [16]: pl.read_parquet_schema("sample_weird_tz.parquet")
Out[16]: {'UTC_DATETIME_ID': Datetime(time_unit='ns', time_zone='+00:00')} This should probably error earlier, rather than parsing with an invalid schema |
Something you could do for now is to override the schema: In [20]: pl.scan_parquet("sample_weird_tz.parquet").with_columns(pl.col('UTC_DATETIME_ID').cast(pl.Datetime('ns', 'UTC'))).collect()
Out[20]:
shape: (1, 1)
┌─────────────────────────┐
│ UTC_DATETIME_ID │
│ --- │
│ datetime[ns, UTC] │
╞═════════════════════════╡
│ 2023-06-26 14:15:00 UTC │
└─────────────────────────┘ |
That doesn't sound right,.. Parquet can't store timezone info according to their own doc, as far as I can tell. Are you sure it's not added by Arrow-rs at some point when reading? |
ooh, well-spotted, thanks! |
I mean, it would be arrow2, but still, this line polars/polars/polars-io/src/parquet/read.rs Line 193 in 3282c74
is already making a Field with datatype |
@MarcoGorelli: Arrow says that offset are just as valid timezone specifiers as IANA timezone names and that |
Thanks for the link It's intentional to only support IANA time zone names, see: #9103 (comment) If it's only for the sake of If other issues come up, then maybe |
I can understand why fixed offsets might cause issues, espeically among something as complicated as date handling. There's always the possibility of handling these like Parquet does, by just applying the offset so they're all UTC and losing the offset information. I can't imagine such offsets would be that common of an occurence anyway, and having a special case to alias +00:00 to UTC for the performance would probably work well enough as an alternative to just rejecting them outright. |
yup! this is what's done in the issue is that here '+00:00' is part of the time zone... doing this works --- a/polars/polars-core/src/series/from.rs
+++ b/polars/polars-core/src/series/from.rs
@@ -198,8 +198,9 @@ impl Series {
let mut tz = tz.clone();
if tz.as_deref() == Some("") {
tz = None;
+ } else if tz.as_deref() == Some("+00:00") {
+ tz = Some("UTC".to_string());
} |
Will look into this, but for the second file, it's read just fine: In [31]: pl.read_parquet_schema("sample_weird_tz.parquet")
Out[31]: {'UTC_DATETIME_ID': Datetime(time_unit='ns', time_zone='+00:00')}
In [32]: pl.read_parquet_schema("sample_df_polars_identical.parquet")
Out[32]: {'UTC_DATETIME_ID': Datetime(time_unit='us', time_zone='UTC')} |
I think it does store timezone info: In [38]: pd.DataFrame({'ts': pd.date_range('2000', periods=3, tz='+01:00')}).to_parquet('tmp.parquet')
In [39]: pl.read_parquet_schema("tmp.parquet")
Out[39]: {'ts': Datetime(time_unit='ns', time_zone='+01:00')} So I think the fix here is to raise a loud and informative error if the time zone is one that chrono-tz can't parse, informing users of how to work around it |
A "loud and informative" error is what I just got in my code that has been working for months now where I explicitly try to convert the timezone to |
thanks for the report, I'll see what can be done |
@kavorite just for my understanding, how are you ending up with |
@MarcoGorelli encoded using parquet-go |
thanks! does that one also produce dtype with timezones like |
only UTC afaik, I've actually tried using it with timezone-aware time.Time instances in my encoder, but it seems to encode them naively regardless |
thanks - OK may it's fine to just convert |
@sm-Fifteen how about you, how did you end up with |
Are you sure this is always true? For example, looks like pyarrow doesn't allow '+00:00' for arrays In [55]: pc.assume_timezone(pa.array([datetime(2020, 1, 1)]), '+00:00')
---------------------------------------------------------------------------
ArrowInvalid: Cannot locate timezone '+00:00': +00:00 not found in timezone database @kavorite would it be possible to save using It wouldn't technically be difficult to map '+00:00' to 'UTC' within polars, I'm just trying to be very careful before doing such special-casing |
It's the output of some big ETL data processing tool that uses "parquet-cpp-arrow version 7.0.0" under the hood, judging by the file footer. It's at least partly a bug on their part, because my source data actually contained naïve datetimes, and the tool in question assumed (correctly, as it happens) that these were UTC timestamps. I don't know how they configured the writer on their end, but the Parquet decoder Polars uses seems to somehow recognize a fixed offset of "+00:00" in the metadata. I'm not entirely sure if it's because there's some metadata field in the file that actually indicates this specific offset or if it's injected by the decoder. |
i don’t know if that’s what pyarrow does internally, but i do know that if I use the pyarrow decoder then loading this file Just Works, so I suspect the condition you used may not check whether date times are valid, but instead whether date time conversion is. Also recall that there are ways to correct my pain points beyond just special-casing, even if the “special case” is an explicit carve out made in other implementations. If the interface to override time zone information altogether actually did its job, none of this would be a problem for me. The problem isn’t that polars is crashing, the problem is that the fix to make polars’ “bad behavior” explicit in this case also won’t make it work. I can’t even cast to the physical type and back: Despite any protests, polars always finishes having its crisis of identity over my data that is actually serialized correctly before ever encountering these explicit conversions and hence the entire lazy loading API becomes inaccessible anyway. Isn’t that also technically applicable to my poor experience? It doesn’t seem that fixing such a thing requires any “special casing.” |
thanks for your responses this is what I meant by special-casing: https://github.com/pola-rs/polars/pull/9747/files The issue with not validating the time zone right away and just letting it be written by the lazy interface is that, as reported in the first message here, you might end up with
, on which all datetime computations would fail (though maybe there is a way around this?) I'm suggesting that just treating |
last time I checked I was able to successfully there was some regression between 0.18.1 and 0.18.5 that made this code just crash |
Yeah it was due to #9598, which prevented invalid time zones from being present when loading If Could you confirm whether If not, then #9598 could just be reverted, no big deal |
+00:00 is the only one that my application needs for the moment |
The problem is that "+00:00" isn't actually invalid. It's perfectly valid according to the Arrow spec, it's just that Polars doesn't handle it. The mistake here was assuming that any timezone specifier that Polars can't handle is invalid. |
Ok, it prevented time zones not present in the time zone database from being present when loading Point is - is it only I'd suggest starting with #9747 , and then, if other issues arise, just revert #9598 |
FYI, it's referred to as an "Offset date-time" in the stream format spec. It's very much part of the Arrow format (regardless of whether or not it's part of the Parquet format). EDIT: Actually, given the way datetimes are stored in Arrow, you don't even need to understand the timezone string (which can be arbitrary and has room to be extended) in order to work with the dates. To try and be flexible with the data you accept, could just have an operation (
Timezone info is mainly intended for printing and shouldn't affect how Polars handles anything, AFAICT. |
Are you sure? As far I can tell from the link you posted, it's part of the schema for a single e.g. pyarrow allows a scalar to have a fixed offset In [10]: pa.scalar(datetime(2012, 1, 1), type=pa.timestamp('s', tz='+01:00'))
Out[10]: <pyarrow.TimestampScalar: datetime.datetime(2012, 1, 1, 1, 0, tzinfo=pytz.FixedOffset(60))> but it doesn't allow an array to have that as the dtype's timezone: In [13]: pc.assume_timezone(pa.array([datetime(2020, 1, 1)]), '+00:00')
---------------------------------------------------------------------------
ArrowInvalid: Cannot locate timezone '+00:00': +00:00 not found in timezone database |
Not sure what you mean, sorry. Does't timezone info also affect daylight savings offsets and calendar durations (such as "1 day", which may be 23 hours, 24 hours, or 25 hours, depending on whether DST is being crossed)? |
According to the Arrow format definitions, all timezone-aware timestamps are stored as "instants" (64-bit signed integer of s/ms/µs/ns elapsed since the UNIX epoch) and can be assumed to all be
I wonder how PyArrow handles standalone scalars; as far as I can tell that's not something Arrow does. From what I'm reading, a "scalar timestamp" with no column metadata would just be an integer assumed to be a number of seconds since the UNIX epoch. Maybe it's just a one-row, one-column dataframe? EDIT: It's apparently a wrapped integer scalar and timestamp metadata (unit and tz) stored together with a bunch of timezone parsing being done just for the Python bindings so the timestamp can be returned as a Python datatime if needed. I wasn't quite right with my one-row column, but it's functionally the same thing. PyArrow's scalar function could therefore throw if you were to use a nonsensical timezone, but mainly because it needs to be able to return valid Python datetimes. |
Thanks for looking into this
Sure, but it's true for |
@MarcoGorelli: That might actually be an issue with PyArrow. The assume_timezone doc doesn't mention it, but the error message matches LocateZone, leading to here. LocateZone seemgly only tries to match against tzdb to validate the timezone, so it would fail when encountering time offsets. That looks like it's worth reporting.
Assuming timestamps could be allowed to contain unknown (by either Polars or PyArrow) timezones or even technically arbitrary strings as part of the timezone descriptor is some kind of "ignore unknown timezone name and only UTC value" escape hatch, but communicating this properly to users would be tricky (because "ignoring timezone" tends to mean you fall back to naïve, not to a stored UTC value), and a simple warning could lead to strange pitfalls when the date-time representation becomes important (mainly formatting and conversion to naïve date-times). Or I might just be otherthinking this given the initial issue was "+00:00 isn't recognized as UTC". |
@MarcoGorelli: Should this be left open with regards to more general handling of timeone strings, or would it be preferable to move that to a separate issue, if that's even considered something Polars should want to handle? Or should we just reccomend that users with any other timezone specifier just cast their column to any IANA timezone like you suggested here and not try to handle strange cases within the library? |
do you mean, having fixed offsets in the datatype? that was intentionally removed some time ago #8860 not having fixed offsets as datatypes has several advantages, see #9103 (comment) do you have a specific use case where you need them? |
More like having an escape hatch for otherwise invalid (either fixed offsets or actually unknown string values) timezone specifiers. I don't really have a use case for those myself, I'm really more asking for the sake of robustness and not just panicking when encountering an unrecognized string.
I understand that columns can only have a single timezone (for fairly good reasons), which saves having to normalize these when constructing dataframes and having to guess what the original timezone was, but I don't believe not supporting these to the point of panicking when a fixed offset is encountered is the best solution there is. For the "every timestamp in this column could have its own offset" situation, I could see having some sort of |
@MarcoGorelli: I'm bringing this back up given you've pinged me on #12149 (comment) and that other issue also has to do with handling timezone offsets, where being able to keep track of how they vary on a per row basis can be important. The other issue is about truncating timestamps on the DST fold and handling ambiguous values, where the entire column is set to a IANA timezone that respects some form of DST. It makes sense to perform truncation by going through naïve datetime conversion, but if "offset-only" timestamps are all normalized to UTC on load, they would all get truncted relative to UTC and the user would have no means of performing the operation while accounting for their original offset. I'm considering raising this as a new issue instead, but I still think having some mean of splitting and storing the original timezone specifier/offset of each row in a separate column for operations that may need it to be applied at some point, like rounding and formatting, (and maybe allowing an "external timestamp offset column" to be specified for said operations) would prove useful when dealing with timestamps in row-based formats. |
Hi! Not sure if I should open a new issue, as a lot of the discussion in this thread goes over my head, but I am having a problem with a similar issue: A pandas dataframe is being saved to parquet. It contains a column of type
Is this part of the same issue you were discussing here? Should I open a new one? Thanks |
thanks, should be addressed by #13738 but I'd really suggest, when you create your pandas dataframe, to set the time zone to the Area/Location one which your timestamps come from |
Thank you very much for the suggestion, that is definitely the most correct way to go about things. It is the result of an extraction from a system through pandas and directly dumped to parquet, but I will see if I can get that change included there. |
Polars version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Issue description
I have a parquet file generated by an external program that Polars is having trouble with with because of a timezone issue. I'm really not sure what's wrong with this file, because I believe Parquet stores its timestamps as int64 values and doesn't record any timezone information besides whether it's naïve or UTC. Polars can read it just fine, but if I try to work on the resulting dataframe, I get a
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: "'+00:00' is not a valid timezone"
error or something similar, depending on the operation.Here's the file in question,
sample_weird_tz.parquet
, which has been reduced to one row and one column to better show the problem, alongside another parquet file containing an identical but working dataframe, generated by Polars,sample_df_polars_identical.parquet
. That second file gets read just fine and shows that it's a problem specific to something aboutsample_weird_tz
, not Parquet file reading as a whole.Reproducible example
Expected behavior
The specific error from my code example points to this line:
polars/polars/polars-arrow/src/kernels/time.rs
Line 105 in 44c093c
I would certainly not expect Polars to panic when trying to manipulate the dataframe coming out of this file. Pandas (or rather fastparquet) seems to be able to read it just fine.
Installed versions
The text was updated successfully, but these errors were encountered: