Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

different results between 0.19.8 and 0.20.31 #16937

Closed
2 tasks done
Tmonster opened this issue Jun 13, 2024 · 2 comments
Closed
2 tasks done

different results between 0.19.8 and 0.20.31 #16937

Tmonster opened this issue Jun 13, 2024 · 2 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@Tmonster
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

python script for the bug

#!/usr/bin/env python3

import polars as pl
from polars import col

src_grp = "G1_1e7_1e2_5_0.csv"

with pl.StringCache():
    x = (pl.read_csv(src_grp, dtypes={"id4":pl.Int32, "id5":pl.Int32, "id6":pl.Int32, "v1":pl.Int32, "v2":pl.Int32, "v3":pl.Float64}, low_memory=True)
         .with_columns(pl.col(["id1", "id2", "id3"]).cast(pl.Categorical)))

ans = x.group_by(["id1","id2","id3","id4","id5","id6"]).agg([pl.sum("v3").alias("v3"), pl.count("v1").alias("count")])
chk = ans.lazy().select([pl.col("v3").sum(), pl.col("count").cast(pl.Int64).sum()]).collect()
print(f"chk = {chk}")

compare results with 2 different polars versions

# get the dataset 
wget https://duckdb-blobs.s3.amazonaws.com/data/G1_1e7_1e2_5_0.csv
# setup virtual environments
python3 -m venv polars-0.19.8
python3 -m venv polars-0.20.31
# run on 0.19.8
source polars-0.19.8/bin/activate
python3 -m pip install polars==0.19.8
python3 -m pip install numpy
python3 bug.py
# chk = shape: (1, 2)
# ┌──────────┬──────────┐
# │ v3       ┆ count    │
# │ ---      ┆ ---      │
# │ f64      ┆ i64      │
# ╞══════════╪══════════╡
# │ 4.7497e8 ┆ 10000000 │
# └──────────┴──────────┘
deactivate
# run on 0.20.31
source polars-0.20.31/bin/activate
python3 -m pip install polars==0.19.8
python3 -m pip install numpy
python3 bug.py
# chk answer 1 = shape: (1, 2)
#┌──────────┬─────────┐
# │ v3       ┆ count   │
# │ ---      ┆ ---     │
# │ f64      ┆ i64     │
# ╞══════════╪═════════╡
# │ 4.7497e8 ┆ 9500000 │
# └──────────┴─────────┘

Log output

No response

Issue description

The count(*) result is not consistent between versions.

Expected behavior

The output is the same

Installed versions

--------Version info---------
Polars:               0.20.31
Index type:           UInt32
Platform:             macOS-13.4.1-arm64-arm-64bit
Python:               3.12.3 (main, Apr  9 2024, 08:09:14) [Clang 15.0.0 (clang-1500.1.0.2.5)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              <not installed>
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
--------Version info---------
Polars:              0.19.8
Index type:          UInt32
Platform:            macOS-13.4.1-arm64-arm-64bit
Python:              3.12.3 (main, Apr  9 2024, 08:09:14) [Clang 15.0.0 (clang-1500.1.0.2.5)]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         <not installed>
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              <not installed>
gevent:              <not installed>
matplotlib:          <not installed>
numpy:               1.26.4
openpyxl:            <not installed>
pandas:              <not installed>
pyarrow:             <not installed>
pydantic:            <not installed>
pyiceberg:           <not installed>
pyxlsb:              <not installed>
sqlalchemy:          <not installed>
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>
@Tmonster Tmonster added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jun 13, 2024
@mcrumiller
Copy link
Contributor

pl.count() has changed since 0.19.8 and 0.20.31, it now does not count null values, hence why it is returning a smalling number.

0.19

Count the number of values in this column/context.
Warning
null is deemed a value in this context.

0.20 (stable)

Return the number of non-null values in the column.
...
Calling this function without any arguments returns the number of rows in the context. This way of using the function is deprecated. Please use len() instead.

@Tmonster
Copy link
Author

Ah ok, yea that's probably the issue. I'll close this then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants