summarise bug when the number is too big #126

antonio-yu · 2022-06-14T06:50:38Z

data  = pd.DataFrame([['xiaomi',80000000000]
                      ,['xiaomi',90000000000]
                      ,['xiaomi',30000000000]
                      ,['huawei',20000000000]
                      ,['huawei',60000000000]
                      ,['huawei',70000000000]
                      ],columns=['brand','sale'])
data >> group_by(f.brand)>> summarize(sale= f.sale.um(),avg = f.sale.mean(),max = max_(f.sale)) 


output 
brand | sale | avg | max
<object> | <int64> | <float64> | <int64>
xiaomi | 200000000000 | 2.000000e+11 | 200000000000
huawei | 150000000000 | 1.500000e+11 | 150000000000

It returns the same results when the column 'sale' is too big . It seems when the number is bigger than 1e+08 , this bug shows.
On ther other hand, the type of avg is scientific notation, but others are normal.

The text was updated successfully, but these errors were encountered:

pwwang · 2022-06-14T15:54:57Z

It's not a bug. The latter f.sale refers to the column created by summarize with the first item, instead of the original column in the original data frame.

See the same behavior in R:

r$> data                                                                                               
# A tibble: 6 × 2
         sale brand 
        <dbl> <chr> 
1 80000000000 xiaomi
2 90000000000 xiaomi
3 30000000000 xiaomi
4 20000000000 huawei
5 60000000000 huawei
6 70000000000 huawei

r$> data |> group_by(brand) |> summarise(sale = sum(sale), avg = mean(sale))                           
# A tibble: 2 × 3
  brand          sale          avg
  <chr>         <dbl>        <dbl>
1 huawei 150000000000 150000000000
2 xiaomi 200000000000 200000000000

To avoid this, either:

use a different aggregation name:

data >> group_by(f.brand) >> summarize(sale_sum = f.sale.sum(), avg = f.sale.mean(), max = max_(f.sale)) 
#                                      ^^^^^^^^

put the sale column at last:

data >> group_by(f.brand) >> summarize(avg = f.sale.mean(), max = max_(f.sale), sale = f.sale.sum())

pwwang · 2022-06-14T15:55:59Z

For the format of avg, it's because it is a float and others are integers.

To control the format of float:

pd.options.display.float_format = '{:.2f}'.format

pwwang added bug Something isn't working and removed bug Something isn't working labels Jun 14, 2022

pwwang closed this as completed Jun 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

summarise bug when the number is too big #126

summarise bug when the number is too big #126

antonio-yu commented Jun 14, 2022 •

edited by pwwang

Loading

pwwang commented Jun 14, 2022

pwwang commented Jun 14, 2022

summarise bug when the number is too big #126

summarise bug when the number is too big #126

Comments

antonio-yu commented Jun 14, 2022 • edited by pwwang Loading

pwwang commented Jun 14, 2022

pwwang commented Jun 14, 2022

antonio-yu commented Jun 14, 2022 •

edited by pwwang

Loading