Ingest download.o.o Apache access logs and generate metrics.
The basic flow:
- stream log file
- run through
ingest.phpwhich creates summary json files
This process runs multiple log ingests concurrently and will wait for all sub-processes to complete.
The cached data is then:
- aggregated by intervals (day, week, month)
- written to influxdb
A separate, minimal set of aggregation done for each IP protocol data.
aggregate.php: invoke to manage entire process including calling (
ingest.php: used to parse a single log file / stream and dump summary JSON to stdout
~/.cache/openSUSE-release-tools/metrics-access for cache data separated by IP protocol. A single JSON file corresponds to a single access log file.
Future product versions
openSUSE style product versions are parsed via
ingest.php and included in summary JSON files. Any request path to either the main product repositories or any respository seemingly built against a product is included. There are many bogus products found on OBS like
openSUSE_Leap_42.22222 and such which are filtered out during the aggregation step. This allows for the products included in the final output to be independent of the parse-time determination. By filtering valid products last, new product patterns may be added after access to those products has begun and been parsed.
ingest.phpfor the generalized product path detection.
aggregate.phpfor the final product filter (note only version number is included).
A possible improvement would be to automate the update of this pattern based on information in OBS.
A product specific annotation may be added to the Grafana dashboard by duplicating the query used for the other products assuming a schedule was added to the
Factory vs Tumbleweed
Since many repositories that build against Tumbleweed are still named
openSUSE_Factory and the transition between the names was not done automatically it is not fully possible to determine which "product" was the target. As such all
Tumbleweed names are merged and counted under
Tumbleweed including main repository access. This could be extended to show some sort of conversion from
Tumbleweed, but the primary goal was to show total users.
Given the archival log data is located on a different network from the active data the tool must be run from a machine with access to both or in two steps. Once the summary data has been generated access to the original log files is no longer necessary.
Existing tools (like telegraf) were evaluated, but found to be far too slow to process the greater than
20TB of raw access log data. PHP was selected since it runs around an order of magnitude faster than python to simply open log file and run a "startswith" on against each line. Adding additional logic sees the performance gap widen significantly. After all is said and done ~500,000 entries/second is acheived on each core of development laptop. There is no comparison to the less than 1,000 entries/second processed by telegraf.
The original run on
tortuga.suse.de, using 7 cores, took roughly 23 hours to process
22TB of data into
12GB of summary data. This data takes up less than
6MB in influxdb once aggregated.