Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP tasks: aggregations #4

Merged
merged 1 commit into from
Jul 18, 2017
Merged

WIP tasks: aggregations #4

merged 1 commit into from
Jul 18, 2017

Conversation

dinosk
Copy link
Member

@dinosk dinosk commented Jul 7, 2017

  • ADDS aggregation classes for processing event data within specified time
    frames

  • ADDS templates for the monthly aggregation indices

Signed-off-by: Dinos Kousidis konstantinos.kousidis@cern.ch

Copy link
Member

@nharraud nharraud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good start, I just added a few comments which for the most part we already discussed Friday, just as a memo.

@@ -0,0 +1,52 @@
{
"template": "monthly-file_download-*",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove monthly, this is part of the suffix.

CELERY_BEAT_SCHEDULE = {
'indexer': {
'task': 'invenio_stats.tasks.index_events',
'schedule': timedelta(seconds=5),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not schedule any processing under 5 minutes.

"type": "integer",
"index": "not_analyzed"
},
"bucket": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the bucket again? IMHO we should use something more explicit.

}
},
"aliases": {
"monthly-file_download": {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, no need to specify "monthly"

@@ -19,7 +19,11 @@
"@timestamp": {
"type": "date"
},
"bucket": {
"bucket_id": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here regarding the bucket.

def get_bookmark(self):
"""Get last aggregation date."""
if not Index(self.aggregation_alias,
using=self.client).exists():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure to understand what you are trying to do here. The alias should always exist.

bookmark = {
'date': datetime.date.today().isoformat()
}
yield dict(_index='{0}-{1}-{2}'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there is a yield of only one element (i.e. without a loop) it is most of the time that there is a design issue. Here it is that that there is no need to use "bulk" if you are only sending one document to elasticsearch. Bulk is for grouping operations and it makes handling errors more difficult as it returns the status for every element separately. There is another command which is easier for writing just one document.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I don't see any error handling. One thing that should never happen is: having day 1 fail but day 2 succeed. It would mean that we would never reprocess day 1 as the latest bookmark is for day 2.

_index='{0}-{1}-{2}'.
format(self.time_frame,
self.event,
date.strftime('%Y-%m')),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to use time_frame, instead we can just let the user configure the '%Y-%m' part.

index=index_name)
self.query.aggs.bucket('by_{}'.format(self.event),
'terms', field='bucket',
size=999999)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As said Friday, we need to be careful here. We should be able to handle the case where there are thousands of files which have been downloaded, and I don't think that Elasticsearch will scale well with that many buckets.

for event in current_stats._events_config.values()]
event_templates = ['contrib/{0}'.format(event['event_type'])
for event in current_stats._events_config.values()]
aggregation_templates = ['contrib/aggregations']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aggregations should be added only if the corresponding events have also been added. We will probably have N events -> M aggregations, where M>N, because multiple aggregations can be deduced from the same event. Thus we could just have an _aggregations_config and aggregations in current_stats which would work the same way as events. That way we let the user choose which aggregations he wants.

@dinosk dinosk force-pushed the aggs branch 5 times, most recently from 39c02dd to ab8ab70 Compare July 17, 2017 19:20
- ADDS aggregation classes for processing event data within specified time
  frames

- ADDS templates for the monthly aggregation indices

Signed-off-by: Dinos Kousidis <konstantinos.kousidis@cern.ch>
@nharraud nharraud merged commit cc44e88 into inveniosoftware:master Jul 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants