Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

__getstate__ of sliced string Series keeps reference to original Series values. #15246

Open
2 tasks done
dalejung opened this issue Mar 23, 2024 · 0 comments
Open
2 tasks done
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@dalejung
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import string
import random
import polars as pl


N = 100
s = pl.Series([
    f"|{x}|" + "".join(random.sample(string.ascii_letters, 20)) for x in range(N)
])

# Verify that end marker is in full serialized state
end_row_marker = f"|{N - 1}|".encode()
original_state = s.__getstate__()
print(f"{len(original_state)=}")
assert end_row_marker in original_state

# take a 1 length slice
sliced = s.head(1)

# create an equivalent copy of the slice
good = pl.Series(sliced.to_list())
assert sliced.equals(good)

# validate the good case first.
good_state = good.__getstate__()
print(f"{len(good_state)=}")
# output state should only include marker for first row |0|
assert end_row_marker not in good_state

# Sliced should be equivalent of good_state
sliced_state = sliced.__getstate__()
print(f"{len(sliced_state)=}")
# FAIL: still includes data for last row.
assert end_row_marker not in sliced_state

Log output

No response

Issue description

Basically re-opening #13972.

The issue is the same as the original. For certain string series a sliced version of it will still serialize the original Series values.

The output from the above script shows that while the sliced version is less than the original, it is still much larger the sliced values.

# OUTPUT:
# len(original_state)=4344
# len(good_state)=440
# len(sliced_state)=2808

If you look at the sliced_state, you can see that the original N values still exists.

sliced_state

Expected behavior

The sliced_state should be similarly sized to good_state

Installed versions

--------Version info---------
Polars:               0.20.16
Index type:           UInt32
Platform:             Linux-6.8.1-arch1-1-x86_64-with-glibc2.39
Python:               3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:50:58) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.2.0
gevent:               <not installed>
hvplot:               0.9.2.post8+g4cb29ba
matplotlib:           3.8.3
numpy:                1.26.4
openpyxl:             3.1.2
pandas:               3.0.0.dev0+432.g5bcc7b7077
pyarrow:              16.0.0.dev339+g3a6c55a12.d20240320
pydantic:             1.10.13
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@dalejung dalejung added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Mar 23, 2024
@dalejung dalejung changed the title Get state of sliced string Series keeps reference to original Series values. __getstate__ of sliced string Series keeps reference to original Series values. Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

1 participant