-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP tasks: aggregations #4
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good start, I just added a few comments which for the most part we already discussed Friday, just as a memo.
@@ -0,0 +1,52 @@ | |||
{ | |||
"template": "monthly-file_download-*", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove monthly, this is part of the suffix.
invenio_stats/config.py
Outdated
CELERY_BEAT_SCHEDULE = { | ||
'indexer': { | ||
'task': 'invenio_stats.tasks.index_events', | ||
'schedule': timedelta(seconds=5), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not schedule any processing under 5 minutes.
"type": "integer", | ||
"index": "not_analyzed" | ||
}, | ||
"bucket": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the bucket again? IMHO we should use something more explicit.
} | ||
}, | ||
"aliases": { | ||
"monthly-file_download": {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, no need to specify "monthly"
@@ -19,7 +19,11 @@ | |||
"@timestamp": { | |||
"type": "date" | |||
}, | |||
"bucket": { | |||
"bucket_id": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here regarding the bucket.
def get_bookmark(self): | ||
"""Get last aggregation date.""" | ||
if not Index(self.aggregation_alias, | ||
using=self.client).exists(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure to understand what you are trying to do here. The alias should always exist.
invenio_stats/tasks.py
Outdated
bookmark = { | ||
'date': datetime.date.today().isoformat() | ||
} | ||
yield dict(_index='{0}-{1}-{2}'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When there is a yield of only one element (i.e. without a loop) it is most of the time that there is a design issue. Here it is that that there is no need to use "bulk" if you are only sending one document to elasticsearch. Bulk is for grouping operations and it makes handling errors more difficult as it returns the status for every element separately. There is another command which is easier for writing just one document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I don't see any error handling. One thing that should never happen is: having day 1 fail but day 2 succeed. It would mean that we would never reprocess day 1 as the latest bookmark is for day 2.
invenio_stats/tasks.py
Outdated
_index='{0}-{1}-{2}'. | ||
format(self.time_frame, | ||
self.event, | ||
date.strftime('%Y-%m')), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to use time_frame
, instead we can just let the user configure the '%Y-%m'
part.
invenio_stats/tasks.py
Outdated
index=index_name) | ||
self.query.aggs.bucket('by_{}'.format(self.event), | ||
'terms', field='bucket', | ||
size=999999) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As said Friday, we need to be careful here. We should be able to handle the case where there are thousands of files which have been downloaded, and I don't think that Elasticsearch will scale well with that many buckets.
invenio_stats/templates.py
Outdated
for event in current_stats._events_config.values()] | ||
event_templates = ['contrib/{0}'.format(event['event_type']) | ||
for event in current_stats._events_config.values()] | ||
aggregation_templates = ['contrib/aggregations'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The aggregations should be added only if the corresponding events have also been added. We will probably have N events -> M aggregations, where M>N, because multiple aggregations can be deduced from the same event. Thus we could just have an _aggregations_config
and aggregations
in current_stats
which would work the same way as events. That way we let the user choose which aggregations he wants.
39c02dd
to
ab8ab70
Compare
- ADDS aggregation classes for processing event data within specified time frames - ADDS templates for the monthly aggregation indices Signed-off-by: Dinos Kousidis <konstantinos.kousidis@cern.ch>
ADDS aggregation classes for processing event data within specified time
frames
ADDS templates for the monthly aggregation indices
Signed-off-by: Dinos Kousidis konstantinos.kousidis@cern.ch