Skip to content
Permalink
Browse files

Backfilling guidelines

  • Loading branch information...
mortengrouleff committed Jul 3, 2019
1 parent fc92171 commit 8e45ff2b5d0ce945380d0d304f231570b7feea41
@@ -130,7 +130,7 @@ desirable. But raising it reduces the number of small segments
generated, since segments get flushed after at most that amount of
time. Reducing to 30 minute interval to say 5 minutes would make the
fail-over happoen much faster, but at the cost of normal operation, as
there would be 12 * 24 = 288 segments in each data source per day,
there would be 12 * 24 = 288 segments in each datasource per day,
compared to the 2 * 24 = 48 with the current value. The cost of having
all these extra segments would slow down normal operation somewhat.

@@ -467,7 +467,7 @@ only. The new repository files will get copied at that point in time.

A datasource is ultimately bounded by the volume that one CPU thread can manage
to compress and write to the filesystem. This is typically in the 1-4 TB/day range.
To handle more ingest traffic from a specific data source, you ned to provide more
To handle more ingest traffic from a specific datasource, you ned to provide more
variability in the set of tags. But in some cases it may not be possible or desirable to adjust
the set of tags or tagged fields in the client. To solve this case, Humio supports
adding a synthetic tag, that is assigned a random number for each (small bulk) of events.
@@ -176,8 +176,8 @@ This request contains three events. The first two are tagged with `server1` and

Tags are key-value pairs.

Events are stored in data sources. A repository has a set of Data Sources.
Data sources are defined by their tags. An event is stored in a data source
Events are stored in datasources. A repository has a set of Data Sources.
Datasources are defined by their tags. An event is stored in a datasource
matching its tags. If no data source with the exact tags exists it is created.
Tags are used to optimize searches by filtering out unwanted events.
At least one tag must be specified.
@@ -40,7 +40,7 @@ Source is limited to approximately 5 MB/s on average. The exact
amount depends on how much a single CPU core can "Digest". If a Data
Source receives more data for a while, then Humio turns on "auto
sharding". This adds a synthetic tag value for the tag
"#humioAutoShard" to split the stream into multiple data sources. This
"#humioAutoShard" to split the stream into multiple datasources. This
process is fully managed by Humio.

For optimal performance a Data Source should receive 1 KB/s or more on
@@ -10,7 +10,7 @@ These events are designed with GDPR requirements in mind and come in two variant
The purpose of the separation into these two groups is to make the audit trail trustworthy by making the sensitive actions not mutable through Humio.

The sensitive kind is assignment of roles to users on repositories, changing retention settings on repositories,
and deleting repositories and data sources and similar actions. See the list of all logged events below.
and deleting repositories and datasources and similar actions. See the list of all logged events below.

Sensitive events are tagged with `#sensitive="true"`, non-sensitive as `#sensitive="false"`.

@@ -239,7 +239,7 @@ completely blocks access to the audit log.
| `canEditIngestListeners`| Allow creating and updating ingest listeners |
| `canDeleteEvents`| The ability to be able to delete events |
| `canEditRetention`| Allow editing retention on a repository |
| `canDeleteDatasources`| Allow deleting data sources |
| `canDeleteDatasources`| Allow deleting datasources |
| `canDeleteDataspace`| Allow deletion of repositories and views |
| `canChangeDeleteEventsPermission`| Special permission needed to be able to assign the permissions (deleteEvent, deleteDatasource, deleteDataspace and editRetention). |
| `canEditSearchSettings`| Allow editing the default search query and time interval |
@@ -65,7 +65,9 @@ output:
{{< partial "common-rest-params" >}}

{{% notice note %}}
To optimize performance for the data volumes you want to send, and to keep shipping latency down, change the default settings for `compression_level`, `bulk_max_size` and `flush_interval`.
To optimize performance for the data volumes you want to send, and to keep shipping latency down, change the default settings for `compression_level`, `worker`, `bulk_max_size` and `flush_interval`.
Don't raise bulk_max_size much: 100 - 300 is the appropriate range. While doing so may increase throughput of ingest it has a negative impact on search performance of the resulting events in Humio.

{{% /notice %}}

## Adding fields
@@ -78,6 +80,9 @@ fields:
datacenter: dc-a
```

Fields can be turned into tags by including a `@tags` field that lists
the names of fields to turn into tags. This applies to fields both
from the fields sections and from the events being shipped. Refer to [datasources]({{< relref "concepts/datasources" >}}) for information on tags.

### Ingesting to multiple repos using a single ingest token

@@ -74,6 +74,8 @@ output:
bulk_max_size: 200
worker: 1
# Don't raise bulk_max_size much: 100 - 300 is the appropriate range.
# While doing so may increase throughput of ingest it has a negative impact on search performance of the resulting events in Humio.
```

{{% notice note %}}
@@ -96,7 +98,7 @@ You must make the following changes to the sample configuration:
* Specify the text encoding to use when reading files using the `encoding` field.
If the log files use special, non-ASCII characters, then set the encoding here. For example, `utf-8` or `latin1`.

* If all your events are fairly small, you can increase `bulk_max_size` from the default of 200. The default of 200 is fine for most use cases.
* If all your events are fairly small, you can increase `bulk_max_size` from the default of 200 to 300. The default of 200 is fine for most use cases.
The Humio server does not limit the size of the ingest request.
But keep bulk_max_size low, as you may get the requests timed out if they get too large. In case of timeouts filebeat will back off, thus getting worse performance then with a lower bulk_max_size.
(Note! The Humio cloud on cloud.humio.com does limit requests to 32 MB. If you go above this limit, you will get "Failed to perform any bulk index operations: 413 Request Entity Too Large"
@@ -201,7 +203,7 @@ See [the section on tags]({{< ref "tagging.md" >}}) for more information about t
If a `type` is configured in Filebeat it's always used as tag. Other fields can be used
as tags as well by defining the fields as `tagFields` in the
[parser]({{< relref "parsers/_index.md" >}}) pointed to by the `type`.
In Humio tags always start with a #. When turning a field into a tag it will
In Humio tags always start with a #. When turning a field into a tag the name of the field will
be prepended with `#`.


@@ -12,10 +12,10 @@ Archiving works by running a periodic job inside all Humio nodes which looks for
An admin user needs to setup archiving per repository. After selecting a repository on the Humio front page the configuration page is available under Settings.

{{% notice info %}}
For slow moving data sources it can take some time before segments files are completed on disk and then made available for the archiving job. In the worst case a segment file must either contain a gigabyte of uncompressed data or 7 days must pass before it's completed. This limitation will be removed in a future version of Humio.
For slow moving datasources it can take some time before segments files are completed on disk and then made available for the archiving job. In the worst case a segment file must either contain a gigabyte of uncompressed data or 7 days must pass before it's completed. This limitation will be removed in a future version of Humio.
{{% /notice %}}

More on [segments files]({{< relref "concepts/ingest-flow" >}}) and [data sources]({{< relref "concepts/datasources" >}}).
More on [segments files]({{< relref "concepts/ingest-flow" >}}) and [datasources]({{< relref "concepts/datasources" >}}).

## S3 Layout

@@ -174,7 +174,7 @@ queries to improve the readability of your queries.

However, these pipe characters are not mandatory. The Humio query engine can
recognize tag filters, and use this
information to narrow down the number of data sources to search.
information to narrow down the number of datasources to search.
This feature decreases query time.

See the [tags documentation]({{< ref "tagging.md" >}}) for more on tags.
@@ -24,6 +24,14 @@ the [getting started: sending application logs]({{< ref "getting-started-applica

Below is an overview of how the respective flows of sending data to Humio work:

{{% notice note %}}
Humio is optimized for live streaming of events in real time. If you
ship data that are *not* live you need to observe some basic rules in order
for the resulting events in Humio be stored as efficiently as if they
had been live. See [Backfilling](#backfilling-guidelines)
{{% /notice %}}


## Data Shippers {#data-shippers}

A Data Shipper is a small system tool that looks at files and system properties
@@ -100,3 +108,79 @@ graph LR;

As you can see, this is by far the simplest flow, and is completely appropriate
for some scenarios e.g. analytics.

## Sending historial data (Backfilling events) {#backfilling-guidelines}

The other sections were mostly concerned with "live" events that you
want shipped to Humio as soon as it exists. But perhaps you also have
files with events from the latest month on disk and want them sent to
Humio, along with the "live" events?

There are some rules you should observe for optimal
performance of the resulting data inside Humio and in order for this
to not interfere with the live events already flowing, if any.

* Make sure to ship the historical events in order by their timestamp
or at least very close to this ordering. A few seconds have little
consequence whereas hours or days is not good.

* If there are also live events flowing into Humio then make sure the
historical events get an extra tag to separate them from the live
events. This makes them a separate stream that does not overlap the live ones.

* If shipping data in parallel (e.g. running multiple filebeat
instances), then make sure to make those streams visible to Humio by
using distinct tags for each stream. This makes them visible as separate
streams that does not overlap the other historical streams.

If those guide lines are not followed the result is likely to be an
increase in the number of segment files and a much higher IO usage
from that when searching time spans that overlap the historical events
or the live events that were ingested while the the backfill was
active. The segment files are likely to get large and overlapping time
spans, leading to a large number of files being searched even when
searching a short time interval.

In short: Don't ship events into one datasource with timestamps (much) out of order.

### Example: Using filebeat to ship historical events

As an example, let's say you have one file of 10 GB of log for each
day in the last month. You want to send all of them in parallel into
Humio, and there is already a stream of live events flowing. In this
case you should run one instance of the desired shipper
(e.g. filebeat) for each file. Each shipper needs a configuration file
that sets a distinct tag. In this example, lets use the filename being
backfilled as the tag value. For filebeat this can be accomplished by
making the `@source` field that is set by filebeat a tag in the parser
in Humio. Or better yet, you can add or extend the `fields` section in
the config:

```
filebeat:
inputs:
- paths:
- /var/log/need-backfilling/myapp.2019-06-17.log
# section that adds the backfill tag:
fields:
"@backfill": "2019-06-17"
"@tags": ["@backfill", "@type"]
queue.mem:
events: 8000
flush.min_events: 200
flush.timeout: 1s
output:
elasticsearch:
hosts: ["https://$HUMIO_HOST:443/api/v1/ingest/elastic-bulk"]
username: $SENDER
password: $INGEST_TOKEN
compression_level: 5
bulk_max_size: 200
worker: 4
# Don't raise bulk_max_size much: 100 - 300 is the appropriate range.
# While doing so may increase throughput of ingest it has a negative impact on search performance of the resulting events in Humio.
```

0 comments on commit 8e45ff2

Please sign in to comment.
You can’t perform that action at this time.