WIP tasks: aggregations #4

dinosk · 2017-07-07T14:19:05Z

ADDS aggregation classes for processing event data within specified time
frames
ADDS templates for the monthly aggregation indices

Signed-off-by: Dinos Kousidis konstantinos.kousidis@cern.ch

nharraud

It's a good start, I just added a few comments which for the most part we already discussed Friday, just as a memo.

nharraud · 2017-07-07T16:43:26Z

invenio_stats/contrib/aggregations/monthly_file_downloads_v1.json

@@ -0,0 +1,52 @@
+{
+  "template": "monthly-file_download-*",


remove monthly, this is part of the suffix.

nharraud · 2017-07-09T20:29:26Z

invenio_stats/config.py

+CELERY_BEAT_SCHEDULE = {
+    'indexer': {
+        'task': 'invenio_stats.tasks.index_events',
+        'schedule': timedelta(seconds=5),


we should not schedule any processing under 5 minutes.

nharraud · 2017-07-09T20:30:19Z

invenio_stats/contrib/aggregations/monthly_file_downloads_v1.json

+          "type": "integer",
+          "index": "not_analyzed"
+        },
+        "bucket": {


What is the bucket again? IMHO we should use something more explicit.

nharraud · 2017-07-09T20:30:53Z

invenio_stats/contrib/aggregations/monthly_file_downloads_v1.json

+    }
+  },
+  "aliases": {
+    "monthly-file_download": {}


same here, no need to specify "monthly"

nharraud · 2017-07-09T20:31:46Z

invenio_stats/contrib/file_download/file_download_v1.json

@@ -19,7 +19,11 @@
        "@timestamp": {
          "type": "date"
        },
-        "bucket": {
+        "bucket_id": {


same here regarding the bucket.

nharraud · 2017-07-09T21:00:30Z

invenio_stats/tasks.py

+    def get_bookmark(self):
+        """Get last aggregation date."""
+        if not Index(self.aggregation_alias,
+                     using=self.client).exists():


I am not sure to understand what you are trying to do here. The alias should always exist.

nharraud · 2017-07-09T21:03:42Z

invenio_stats/tasks.py

+            bookmark = {
+                'date': datetime.date.today().isoformat()
+            }
+            yield dict(_index='{0}-{1}-{2}'.


When there is a yield of only one element (i.e. without a loop) it is most of the time that there is a design issue. Here it is that that there is no need to use "bulk" if you are only sending one document to elasticsearch. Bulk is for grouping operations and it makes handling errors more difficult as it returns the status for every element separately. There is another command which is easier for writing just one document.

Also I don't see any error handling. One thing that should never happen is: having day 1 fail but day 2 succeed. It would mean that we would never reprocess day 1 as the latest bookmark is for day 2.

nharraud · 2017-07-09T21:11:05Z

invenio_stats/tasks.py

+                           _index='{0}-{1}-{2}'.
+                           format(self.time_frame,
+                                  self.event,
+                                  date.strftime('%Y-%m')),


No need to use time_frame, instead we can just let the user configure the '%Y-%m' part.

nharraud · 2017-07-09T21:13:37Z

invenio_stats/tasks.py

+                                index=index_name)
+            self.query.aggs.bucket('by_{}'.format(self.event),
+                                   'terms', field='bucket',
+                                   size=999999)


As said Friday, we need to be careful here. We should be able to handle the case where there are thousands of files which have been downloaded, and I don't think that Elasticsearch will scale well with that many buckets.

nharraud · 2017-07-09T21:16:44Z

invenio_stats/templates.py

-            for event in current_stats._events_config.values()]
+    event_templates = ['contrib/{0}'.format(event['event_type'])
+                       for event in current_stats._events_config.values()]
+    aggregation_templates = ['contrib/aggregations']


The aggregations should be added only if the corresponding events have also been added. We will probably have N events -> M aggregations, where M>N, because multiple aggregations can be deduced from the same event. Thus we could just have an _aggregations_config and aggregations in current_stats which would work the same way as events. That way we let the user choose which aggregations he wants.

- ADDS aggregation classes for processing event data within specified time frames - ADDS templates for the monthly aggregation indices Signed-off-by: Dinos Kousidis <konstantinos.kousidis@cern.ch>

nharraud reviewed Jul 9, 2017

View reviewed changes

dinosk force-pushed the aggs branch 5 times, most recently from 39c02dd to ab8ab70 Compare July 17, 2017 19:20

tasks: aggregations

fbfff91

- ADDS aggregation classes for processing event data within specified time frames - ADDS templates for the monthly aggregation indices Signed-off-by: Dinos Kousidis <konstantinos.kousidis@cern.ch>

dinosk force-pushed the aggs branch from ab8ab70 to fbfff91 Compare July 18, 2017 06:22

nharraud merged commit cc44e88 into inveniosoftware:master Jul 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP tasks: aggregations #4

WIP tasks: aggregations #4

dinosk commented Jul 7, 2017

nharraud left a comment

nharraud Jul 7, 2017

nharraud Jul 9, 2017

nharraud Jul 9, 2017

nharraud Jul 9, 2017

nharraud Jul 9, 2017

nharraud Jul 9, 2017

nharraud Jul 9, 2017

nharraud Jul 9, 2017

nharraud Jul 9, 2017

nharraud Jul 9, 2017

nharraud Jul 9, 2017

WIP tasks: aggregations #4

WIP tasks: aggregations #4

Conversation

dinosk commented Jul 7, 2017

nharraud left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment