Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

summarise bug when the number is too big #126

Closed
antonio-yu opened this issue Jun 14, 2022 · 2 comments
Closed

summarise bug when the number is too big #126

antonio-yu opened this issue Jun 14, 2022 · 2 comments

Comments

@antonio-yu
Copy link

antonio-yu commented Jun 14, 2022

data  = pd.DataFrame([['xiaomi',80000000000]
                      ,['xiaomi',90000000000]
                      ,['xiaomi',30000000000]
                      ,['huawei',20000000000]
                      ,['huawei',60000000000]
                      ,['huawei',70000000000]
                      ],columns=['brand','sale'])
data >> group_by(f.brand)>> summarize(sale= f.sale.um(),avg = f.sale.mean(),max = max_(f.sale)) 


output 
brand | sale | avg | max
<object> | <int64> | <float64> | <int64>
xiaomi | 200000000000 | 2.000000e+11 | 200000000000
huawei | 150000000000 | 1.500000e+11 | 150000000000

It returns the same results when the column 'sale' is too big . It seems when the number is bigger than 1e+08 , this bug shows.
On ther other hand, the type of avg is scientific notation, but others are normal.

@pwwang pwwang added bug Something isn't working and removed bug Something isn't working labels Jun 14, 2022
@pwwang
Copy link
Owner

pwwang commented Jun 14, 2022

It's not a bug. The latter f.sale refers to the column created by summarize with the first item, instead of the original column in the original data frame.

See the same behavior in R:

r$> data                                                                                               
# A tibble: 6 × 2
         sale brand 
        <dbl> <chr> 
1 80000000000 xiaomi
2 90000000000 xiaomi
3 30000000000 xiaomi
4 20000000000 huawei
5 60000000000 huawei
6 70000000000 huawei

r$> data |> group_by(brand) |> summarise(sale = sum(sale), avg = mean(sale))                           
# A tibble: 2 × 3
  brand          sale          avg
  <chr>         <dbl>        <dbl>
1 huawei 150000000000 150000000000
2 xiaomi 200000000000 200000000000

To avoid this, either:

  1. use a different aggregation name:
data >> group_by(f.brand) >> summarize(sale_sum = f.sale.sum(), avg = f.sale.mean(), max = max_(f.sale)) 
#                                      ^^^^^^^^
  1. put the sale column at last:
data >> group_by(f.brand) >> summarize(avg = f.sale.mean(), max = max_(f.sale), sale = f.sale.sum()) 

@pwwang
Copy link
Owner

pwwang commented Jun 14, 2022

For the format of avg, it's because it is a float and others are integers.

To control the format of float:

pd.options.display.float_format = '{:.2f}'.format

@pwwang pwwang closed this as completed Jun 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants