Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/CLN: Clean float / complex string formatting #36799

Merged
merged 26 commits into from
Oct 14, 2020
Merged

BUG/CLN: Clean float / complex string formatting #36799

merged 26 commits into from
Oct 14, 2020

Conversation

dsaxton
Copy link
Member

@dsaxton dsaxton commented Oct 2, 2020

Noticed while working on another bug. The _is_number helper here is wrong and can cause incorrect results given that this code path is hit by arbitrary strings (e.g., it thinks "foo" is a number). Also the _trim_zeros_complex helper apparently does nothing:

[ins] In [3]: _trim_zeros_float(["0.00000"])
Out[3]: ['0.0']

[ins] In [4]: _trim_zeros_complex(["1.000+1.000000j"])
Out[4]: ['1.000+1.000000j']

@simonjayhawkins
Copy link
Member

Also the _trim_zeros_complex helper apparently does nothing:

something is wrong then.

_trim_zeros_complex was 'fixed' in #25745, why is test_to_string_complex_float_formatting not failing? has there been some other refactor since?

@dsaxton
Copy link
Member Author

dsaxton commented Oct 2, 2020

_trim_zeros_complex was 'fixed' in #25745, why is test_to_string_complex_float_formatting not failing? has there been some other refactor since?

My guess is it's passing because there are no zeros to trim in the test case.

Instead of trimming zeros it's as though zeros are added; I would expect the formatting behavior instead to be similar to float for both the real and imaginary parts (below is master):

[ins] In [4]: s = pd.Series([0.000, 1.000])
         ...: print(s)
         ...:
         ...: s = pd.Series([0.000+1.000j, 1.000+1.000j])
         ...: print(s)
         ...:
0    0.0
1    1.0
dtype: float64
0    0.000000+1.000000j
1    1.000000+1.000000j
dtype: complex128

Should I open an issue about this, and either close this PR or turn it into a bug fix?

@dsaxton dsaxton changed the title CLN: Clean float / complex string formatting BUG/CLN: Clean float / complex string formatting Oct 2, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you run some asv's i am not sure this is really hit much, but maybe

) -> List[str]:
"""
Trims zeros, leaving just one before the decimal points if need be.
"""
trimmed = str_floats

def _is_number(x):
return x != na_rep and not x.endswith("inf")
return re.match(fr"\s*-?[0-9]+(\{decimal}[0-9]*)?", x) is not None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you compile this and put it on the class / variable

@jreback jreback added the Output-Formatting __repr__ of pandas objects, to_string label Oct 2, 2020
@dsaxton
Copy link
Member Author

dsaxton commented Oct 4, 2020

can you run some asv's i am not sure this is really hit much, but maybe

       before           after         ratio
     [a975a754]       [229b5fc1]
     <master>         <fix-is-numeric-helper>
+      1.84±0.1ms       2.37±0.2ms     1.28  io.csv.ReadCSVCachedParseDates.time_read_csv_cached(False)
+     4.16±0.09ms       4.86±0.5ms     1.17  io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '.', 'high')
-     6.88±0.07ms      6.18±0.09ms     0.90  io.hdf.HDFStoreDataFrame.time_store_info

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

-----

       before           after         ratio
     [a975a754]       [229b5fc1]
     <master>         <fix-is-numeric-helper>
-     6.88±0.03ms      6.07±0.04ms     0.88  io.hdf.HDFStoreDataFrame.time_store_info
-        24.4±1ms       16.2±0.2ms     0.66  io.hdf.HDFStoreDataFrame.time_read_store_table_wide

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

-----

BENCHMARKS NOT SIGNIFICANTLY CHANGED.

-----

       before           after         ratio
     [a975a754]       [229b5fc1]
     <master>         <fix-is-numeric-helper>
+      18.5±0.4ms       24.4±0.7ms     1.32  io.csv.ReadCSVThousands.time_thousands(',', None)
+      17.7±0.8ms       22.4±0.6ms     1.26  io.csv.ReadCSVThousands.time_thousands(',', ',')
+      17.9±0.6ms         21.8±1ms     1.22  io.csv.ReadCSVThousands.time_thousands('|', ',')
-     6.94±0.08ms      6.14±0.04ms     0.89  io.hdf.HDFStoreDataFrame.time_store_info
-     2.09±0.08ms      1.78±0.03ms     0.85  io.csv.ReadCSVFloatPrecision.time_read_csv(';', '_', 'high')
-        5.63±1ms      3.95±0.03ms     0.70  io.csv.ReadCSVFloatPrecision.time_read_csv_python_engine(',', '.', 'high')
-      23.6±0.7ms       16.1±0.6ms     0.68  io.hdf.HDFStoreDataFrame.time_read_store_table_wide

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

-----

BENCHMARKS NOT SIGNIFICANTLY CHANGED.

Ran the IO asvs a few times and results aren't consistent. Maybe there's something going on with io.csv / io.hdf but hard to say.

Comment on lines 1880 to 1889
max_length = max(lengths)
padded = [
s[: -((k - 1) // 2 + 1)] # real part
+ (max_length - k) // 2 * "0"
+ s[-((k - 1) // 2 + 1) : -((k - 1) // 2)] # + / -
+ s[-((k - 1) // 2) : -1] # imaginary part
+ (max_length - k) // 2 * "0"
+ s[-1]
for s, k in zip(complex_strings, lengths)
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be safer to split real and imaginary parts via +- and then process decimal and fractional parts by splitting via the dot? This way you would not need to rely on the symmetry of the original string provided.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean trim zeros after splitting into fractional non-fractional parts? I think the trimming has to be done with the decimal there. (I realize this helper is very confusing, and there's likely a better way to do this.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I mean trimming zeros after splitting into fractional and non-fractional parts. Since a dot char would always split float number, there is no risk to introduce a bug IMHO (even if there is no dot at all).

Copy link
Member Author

@dsaxton dsaxton Oct 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think then you have to keep track of which parts are fractional and then only trim those?

However this part is not doing any actual trimming, it's correcting for the fact that the previous function is now trimming "too much." (It trims the real and imaginary parts of each complex number independently, so they aren't aligned afterwards. Rather than rewrite the other function I found it easier to do this post-processing.)

pandas/io/formats/format.py Outdated Show resolved Hide resolved
pandas/io/formats/format.py Show resolved Hide resolved
@jreback jreback added this to the 1.2 milestone Oct 10, 2020
@jreback
Copy link
Contributor

jreback commented Oct 10, 2020

lgtm can you add a whatsnew note and ping on green.

@dsaxton
Copy link
Member Author

dsaxton commented Oct 11, 2020

lgtm can you add a whatsnew note and ping on green.

@jreback Added note + green



def test_to_string_complex_number_trims_zeros():
s = pd.Series([1.000000 + 1.000000j, 1.0 + 1.0j, 1.05 + 1.0j])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why should these have 2 decimal zeros and not 1 likely ordinary floats? Where did you get the expected output from?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a 1.05 in the last element, the expected output is similar to what happens for floats:

[ins] In [2]: s = pd.Series([1.0, 1.0000, 1.05])

[ins] In [3]: s
Out[3]: 
0    1.00
1    1.00
2    1.05
dtype: float64

@jreback jreback merged commit 44ab7a1 into pandas-dev:master Oct 14, 2020
@jreback
Copy link
Contributor

jreback commented Oct 14, 2020

thought we had merged this, thanks @dsaxton

@dsaxton dsaxton deleted the fix-is-numeric-helper branch October 14, 2020 15:06
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Oct 26, 2020
kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants