Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LazyFrame drop complains other existing column is not present incorrectly #12722

Closed
2 tasks done
DarkAmoeba opened this issue Nov 27, 2023 · 4 comments
Closed
2 tasks done
Labels
bug Something isn't working python Related to Python Polars

Comments

@DarkAmoeba
Copy link

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import datetime
import polars as pl

df = pl.DataFrame([{'sarGateID': 12029,
  'ntpTime': datetime.datetime(2023, 10, 23, 12, 35, 23, 769956),
  'serviceVersion': '17.1.2',
  'serviceTime': datetime.datetime(2023, 10, 23, 12, 35, 23, 769956),
  'lab': 'tduc',
  'hostname': 'vc-fs06-tduc',
  'payload': {'GUFI': '00000000-0000-4000-A000-000000008999',
   'Flight_Id': 9001,
   'Operation_Type': 'ADDITION'},
  'version': '17.1.2',
  'year': 2023,
  'ymd': 20231023}]).lazy()

df2 = df.select([pl.col('ntpTime'), 
                 pl.col('serviceTime'),
                 pl.col('hostname'), 
                 pl.col('payload').struct.field('GUFI'), 
                 pl.col('payload').struct.field('Flight_Id'), 
                 pl.col('payload').struct.field('Operation_Type'), 
                ])

(df2
 .drop('Operation_Type')
).fetch()

Log output

---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[171], line 27
      4 df = pl.DataFrame([{'sarGateID': 12029,
      5   'ntpTime': datetime.datetime(2023, 10, 23, 12, 35, 23, 769956),
      6   'serviceVersion': '17.1.2',
   (...)
     14   'year': 2023,
     15   'ymd': 20231023}]).lazy()
     17 df2 = df.select([pl.col('ntpTime'), 
     18                  pl.col('serviceTime'),
     19                  pl.col('hostname'), 
   (...)
     22                  pl.col('payload').struct.field('Operation_Type'), 
     23                 ])
     25 (df2
     26  .drop('Operation_Type')
---> 27 ).fetch()

File /usr/local/CONDA/conda_toast-3.11.01/lib/python3.11/site-packages/polars/utils/deprecation.py:95, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     90 @wraps(function)
     91 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     92     _rename_keyword_argument(
     93         old_name, new_name, kwargs, function.__name__, version
     94     )
---> 95     return function(*args, **kwargs)

File /usr/local/CONDA/conda_toast-3.11.01/lib/python3.11/site-packages/polars/lazyframe/frame.py:2283, in LazyFrame.fetch(self, n_rows, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, no_optimization, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, streaming)
   2270     comm_subexpr_elim = False
   2272 lf = self._ldf.optimization_toggle(
   2273     type_coercion,
   2274     predicate_pushdown,
   (...)
   2281     eager=False,
   2282 )
-> 2283 return wrap_df(lf.fetch(n_rows))

ComputeError: column 'GUFI' not available in schema Schema:
name: sarGateID, data type: Int64
name: ntpTime, data type: Datetime(Microseconds, None)
name: serviceVersion, data type: Utf8
name: serviceTime, data type: Datetime(Microseconds, None)
name: lab, data type: Utf8
name: hostname, data type: Utf8
name: payload, data type: Struct([Field { name: "GUFI", dtype: Utf8 }, Field { name: "Flight_Id", dtype: Int64 }, Field { name: "Operation_Type", dtype: Utf8 }])
name: version, data type: Utf8
name: year, data type: Int64
name: ymd, data type: Int64

Issue description

This seems like the same error as reported in #1644, which was fixed in an earlier build if this is the same issue

Expected behavior

This works if I remove the .lazy_frame and the .fetch().

Installed versions

--------Version info---------
Polars:              0.19.3
Index type:          UInt32
Platform:            Linux-3.10.0-957.el7.x86_64-x86_64-with-glibc2.17
Python:              3.11.5 | packaged by conda-forge | (main, Aug 27 2023, 03:34:09) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         <not installed>
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              2023.9.1
gevent:              23.9.0.post1
matplotlib:          3.8.0
numpy:               1.26.0
pandas:              2.1.0
pyarrow:             13.0.0
pydantic:            2.3.0
sqlalchemy:          2.0.21
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>
Selection deleted

@DarkAmoeba DarkAmoeba added bug Something isn't working python Related to Python Polars labels Nov 27, 2023
@cmdlineluser
Copy link
Contributor

cmdlineluser commented Nov 27, 2023

Just to confirm, this also happens on 0.19.17.

It seems to trigger when .struct is used.

It's appears to be using the original schema from df instead of df2

df = pl.select(A = pl.struct(B = pl.lit('C'))).with_row_count()

df2 = df.lazy().select(pl.col('A').struct['B'])

print(df2.schema)
# OrderedDict([('B', Utf8)])

df2.drop('aaa').collect()
# ComputeError: column 'B' not available in schema Schema:
# name: row_nr, data type: UInt32
# name: A, data type: Struct([Field { name: "B", dtype: Utf8 }])
#
# ^^^ this is `df.schema` 

It seems to be related to projection_pushdown, using the original example:

(df2.drop('Operation_Type')
    .collect(projection_pushdown=False)
)

# shape: (1, 5)
# ┌────────────────────────────┬────────────────────────────┬──────────────┬───────────────────────────────────┬───────────┐
# │ ntpTime                    ┆ serviceTime                ┆ hostname     ┆ GUFI                              ┆ Flight_Id │
# │ ---                        ┆ ---                        ┆ ---          ┆ ---                               ┆ ---       │
# │ datetime[μs]               ┆ datetime[μs]               ┆ str          ┆ str                               ┆ i64       │
# ╞════════════════════════════╪════════════════════════════╪══════════════╪═══════════════════════════════════╪═══════════╡
# │ 2023-10-23 12:35:23.769956 ┆ 2023-10-23 12:35:23.769956 ┆ vc-fs06-tduc ┆ 00000000-0000-4000-A000-00000000… ┆ 9001      │
# └────────────────────────────┴────────────────────────────┴──────────────┴───────────────────────────────────┴───────────┘

@kszlim
Copy link
Contributor

kszlim commented Dec 8, 2023

I got a similar issue when using pl.from_numpy (utilizing the schema option) and then doing a to_struct afterwards.

When I collect the resultant df, the struct fields exist, but if I try to select the struct field while in a lazy context and the collect afterwards, it doesn't seem to work.

@8uurg
Copy link

8uurg commented Jan 9, 2024

As someone who encountered a similar issue, I've managed to simplify it down to:

example = pl.DataFrame({"f": {"a": [0, 1, 2]}})

(example.lazy()
  .select(pl.col("f").struct.field("a"))
  .select(pl.col("a"))
).collect()

Results in:

ComputeError: column 'a' not available in schema Schema:
name: f, data type: Struct([Field { name: "a", dtype: Int64 }])

A very simple solution I've found is to explicitly alias the extracted field:

example = pl.DataFrame({"f": {"a": [0, 1, 2]}})

(example.lazy()
  .select(pl.col("f").struct.field("a").alias("a"))
  .select(pl.col("a"))
).collect()

This works as expected.

@stinodego stinodego added the needs triage Awaiting prioritization by a maintainer label Jan 13, 2024
@henryharbeck
Copy link
Contributor

Hi @stinodego, I believe this issue can be closed. None of the three reproducible examples raise anymore.

@stinodego stinodego removed the needs triage Awaiting prioritization by a maintainer label Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

6 participants