Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregation on annotated field fails if queryset ordered by other values + set default value #5

Closed
PJCampi opened this issue Apr 29, 2018 · 4 comments

Comments

@PJCampi
Copy link

PJCampi commented Apr 29, 2018

Hey Marts,

First of all thanks for the package, it simple but very useful :)

I have a couple of comments/questions:

1. Aggregation on annotated field
I have noticed that you do not reorder the QuerySet before aggregation, which causes problems when aggregating on an annotated field, if it is already ordered by another field.

For example, consider the model:

class Blog(models.Model):
    date = models.DateField()
    category = models.CharField(max_length=50)
    content = models.TextField(default='')
    
    class Meta:
        order_by = ['date']

If I pivot on an annotated field like:

qs = Blog.objects.annotate(period=Trunc('date', 'year'))
pivot(qs, 'period', 'category', 'content', Count)

Aggregation does not work because the 'date' field must be part of the query's group by for the ordering to work.

Reordering fixes the problem, so this will work:

qs = Blog.objects.annotate(period=Trunc('date', 'year')).order_by('period')
pivot(qs, 'period', 'category', 'content', Count)

I think it would make sense to implement this in the package.

2. Default value
The default column value is None atm. Why not using 0? Alternatively could one set the default value at module level? Something like:

def _get_annotations(column, column_values, data, aggregation):
    value = data if hasattr(data, 'resolve_expression') else F(data)
    return {
        display_value: aggregation(Case(When(Q(**{column: column_value}), then=value), default=DEFAULT_VALUE))
        for column_value, display_value in column_values
    }

3. Cumulative aggregation
It would be fairly easy to allow for cumulative aggregation as well, which can be useful. It can be done using the window function (+distinct):

window = Window(aggregator(data), order_by=row)

It does not work on sqlite though...
I'd be happy to have a look at it if you think it could be useful.

@martsberger
Copy link
Owner

Thanks for your feedback, these are good comments!

#1 Aggregating annotated field and order_by

If it works to remove the ordering altogether, I could implement that, .e.g., if

qs = Blog.objects.annotate(period=Trunc('date', 'year')).order_by()
pivot(qs, 'period', 'category', 'content', Count)

fixes the issue, I could make the pivot function remove the ordering. I think this should work as the result of pivoting should be independent of order_by.

If a specific ordering is required in this case it's a bit trickier, because adding an ordering that isn't necessary in other cases would degrade performance in those cases.

#2 Default value
There is a difference between None and 0 that is meaningful. If I find a bunch of records to sum up and the sum is zero, I should return the value 0. If I look for records to sum up and find no records at all, this is different. If I report 0 in the second case I'm indicating I found records that sum to zero, while if I return None I'm indicating no records were found.

Granted, in some cases people prefer to have the number 0 even in the second case I described above. There are two available ways to achieve this

  • Convert None to 0 after you get the result from pivot
  • Use an aggregation that converts Null to 0 in the database (this might not be the greatest implementation):
from django.db.models import Count
class CountDefaultZero(Count):
    template = 'COALESCE({}, 0)'.format(Count.template)

pivot(qs, 'period', 'category', 'content', CountDefaultZero)

Possibly allowing a default passed into the pivot function would be nicer:

pivot(qs, 'period', 'category', 'content', Count, default=0)

or maybe it unnecessarily clutters the api. I'll give it some thought.

#3 Cumulative aggregation
Does MySQL support the window function or just postgres? I'm okay with having db specific extensions, just need to think about the right way to do it. Is it possible to pass in an aggregation that does this?

If you have an implementation in mind, I'd be interested in seeing it.

@PJCampi
Copy link
Author

PJCampi commented May 2, 2018

#1: order_by() works and it is better than using an arbitrary ordering.

#2: that's what I figured, and that's what Excel pivot does by default too (I think).

It's a valid argument for Sum, less so for e.g. Count. In practice, like in your shirt sales example, if a shirt of a particular style has not been sold, it will be represented with no row present in an inventory table. The total income for that style should ideally return 0eur.

3# I think that most SQL languages support the OVER (PARTITION BY x ORDER BY y) clause which is used by the Window function (see here for MySql). I have not tested it though, I use Postgres.

You cannot pass it as an aggregation because you need to call distinct() afterwards.
Basically instead of:

qs.annotate(**{column: aggregator(data)})

you need to call:

qs.annotate(**{column: Window(aggregator(data), order_by: F(row).asc())}).distinct()

It also requires some kind of support for passing ordering on the row field. Let me know if you want something more concrete.

@martsberger
Copy link
Owner

Opened #9 to address point 2 above.

@martsberger
Copy link
Owner

I opened #31 for cumulative aggregation. With that, I'm going to close this issue as all the other parts have been resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants