Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The future? #3

Open
martindurant opened this issue Dec 8, 2022 · 11 comments
Open

The future? #3

martindurant opened this issue Dec 8, 2022 · 11 comments

Comments

@martindurant
Copy link
Member

This repo currently is a nice little proof-of-concept.

To be a viable package in the pydata realm, it should have

  • decent testing coverage and CI
  • release on pypi/conda
  • move to a recognised github org (intake?)
  • full documentation

Should we poke pandas directly to consider upstreaming?

@martindurant
Copy link
Member Author

@jreback I thought you would be just the kind of person that might have an opinion on this little project.

@jreback
Copy link

jreback commented Dec 16, 2022

yea i think we would highly consider this to vendor even

but starting as an optional is great

@martindurant
Copy link
Member Author

@jrbourbeau - MattR suggested this would be of interest with your ongoing work using spark-generated data types. Maybe you have already solved everything with arrow's decimal type?

@jrbourbeau
Copy link

With pandas 1.5+ we can now use DataFrames with pyarrow-backed decimal128 data. Here's an small snippet which demonstrate that in action:

import decimal
import pandas as pd
import dask.dataframe as dd
import pyarrow as pa

pa_dtype = pa.decimal128(precision=7, scale=3)

data = pa.array(
    [
        decimal.Decimal("8093.012"),
        decimal.Decimal("8094.123"),
        decimal.Decimal("8095.234"),
        decimal.Decimal("8096.345"),
        decimal.Decimal("8097.456"),
        decimal.Decimal("8098.567"),
    ],
    type=pa_dtype,
)
df = pd.DataFrame({"x": data}, dtype=pd.ArrowDtype(pa_dtype))
ddf = dd.from_pandas(df, npartitions=3)

# Some pandas operation
print(f"{df.x * 3 - 1 = }")
# Equivalent Dask operation
print(f"{(ddf.x * 3 - 1).compute() = }")

Experimenting locally, that approach seems to work well for working with decimal data (though I've not done exhaustive testing). Does that align well with pandas-decimal?

@jamescooke
Copy link

On pandas>=2 and pyarrow>=12, there seems to be quite a bit of mileage possible with pd.ArrowDtype(pa.decimal128(...)) as mentioned by @jrbourbeau above.

Should this be wrapped in pandas-decimal, or has progress been made elsewhere? Or maybe some other solution has arisen?

@martindurant
Copy link
Member Author

Since most things should be possible with the arrow extension type alone, there has been no further development or attention here. I daresay I can archive this. The only advantage I see (aside from the technical differences) is the ability to have exact decimals in pandas without requiring arrow. That ship has probably gone.

@jamescooke
Copy link

I daresay I can archive this.

Agree - but - the PyPI page is on the first page of search results for me for "pandas decimal", and I've not found any good write-up of how to "do Decimal in pd DataFrames" - do you have any tips / resources? Maybe they could be linked from the README? 🙏🏻

The only advantage I see (aside from the technical differences) is the ability to have exact decimals in pandas without requiring arrow. That ship has probably gone.

Agree the ship has sailed - that's a pretty niche use case. Functions like pd.read_gbq() require pandas-gbq which installs pyarrow, so it's already in the Python environment for many.

@martindurant
Copy link
Member Author

Agree - but - the PyPI page is on the first page of search results for me for "pandas decimal", and I've not found any good write-up of how to "do Decimal in pd DataFrames" - do you have any tips / resources? Maybe they could be linked from the README?

This is a good point. Maybe I should write up a blog demonstrating decimals in pandas with pyarrow. I can publish this on Anaconda and link in the README as you suggest, but it will take me a little time to get to.

@jamescooke
Copy link

A blog with some more guidance about decimals with PyArrow would be great.

Meanwhile I've been digging into this a little more at work to see if we should use decimal128 for a new part of our pipeline. I agree that decimal128 does provide vectorization and fixed point arithmetic - but there are two features that seem to be missing that I've not found a fix for.

Describe has problems

In [1]: import decimal    
   ...: import pandas as pd                                                                              
                                                    
In [2]: import pyarrow as pa                                                                             
                                                                                                         
In [3]: pa_dtype = pa.decimal128(precision=7, scale=3)
                          
In [4]: data = pa.array(                                                                                 
   ...:     [
   ...:         decimal.Decimal("8093.012"),                                                             
   ...:         decimal.Decimal("8094.123"),                                                             
   ...:         decimal.Decimal("8095.234"),                                                             
   ...:         decimal.Decimal("8096.345"),
   ...:         decimal.Decimal("8097.456"),                                                             
   ...:         decimal.Decimal("8098.567"),                                                             
   ...:     ],                                                                                           
   ...:     type=pa_dtype,                          
   ...: )                                                                                                
   ...: df = pd.DataFrame({"x": data}, dtype=pd.ArrowDtype(pa_dtype))
                                                    
In [5]: df.describe()
---------------------------------------------------------------------------                              
ArrowInvalid                              Traceback (most recent call last)
...
ArrowTypeError: int or Decimal object expected, got numpy.int64

Simple ops have problems

Continuing with the df above - adding 1 to column 'x' is fine:

In [7]: df['x'] + 1                       
Out[7]: 
0    8094.012
1    8095.123
2    8096.234
3    8097.345
4    8098.456
5    8099.567
Name: x, dtype: decimal128(23, 3)[pyarrow]

However, if we change the precision and scale to use more bits things get hairy:

In [13]: df = pd.DataFrame({"x": data}, dtype=pd.ArrowDtype(pa.decimal128(precision=38, scale=3)))

In [14]: df['x'] + 1                                                                                                                                                                                               
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
...
ArrowInvalid: Decimal precision out of range [1, 38]: 39

We know from above where df['x'] + 1 worked and returned values in decimal128(23, 3)[pyarrow] that the results of adding 1 will fit in the decimal128(precision=38, scale=3) (right? correct me if I'm wrong 🙏🏻 ) . But there appears to be a problem in where the spec of the decimals is not getting managed correctly.

Current conclusion

My guess that there are work arounds for the issues above - plus possibly even fixes upstream in pandas. However, this extra complexity for vectorization and smaller memory footprint might not be worth it at this time for us. My plan is to push ahead with decimal.Decimal in the DF and see what walls we collide with 🙏🏻 .

@martindurant
Copy link
Member Author

I must admit, that I had not tried to do too much with arrow decimals. Your notes here are pretty disappointing! If you use dtype="decimal[3]" from this package, describe() at least does work. Also, we are significantly faster:

In [23]: df = pd.DataFrame({"a": [0, 1] * 100000}, dtype=pd.ArrowDtype(pa.decimal128(precision=7, scale=3)))

In [26]: %timeit df.a.std()
1.28 ms ± 7.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [27]: df = pd.DataFrame({"a": [0, 1] * 100000}, dtype="decimal[3]")

In [28]: %timeit df.a.std()
340 µs ± 2.62 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

As for precision, you would expect maybe to be bounded by integers up to 2**63 (~19 decimal places for numbers near 1). I don't know why arrow has both precision and scale...

For a conclusion, decimal./Decimal may solve your problem, but it is orders of magnitude slower.

In [35]: df["a"] = df.a.map(lambda x: decimal.Decimal(str(x)))

In [39]: %timeit df.a.mean()  # this is MEAN because std fails
12.9 ms ± 96.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@jamescooke
Copy link

Regarding pyarrow being required by pandas, I found this just now: pandas-dev/pandas#52509

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants