diff --git a/examples/distributed-loki/1-prerequisites/config.yaml b/examples/distributed-loki/1-prerequisites/config.yaml index d023e68..643e3b8 100644 --- a/examples/distributed-loki/1-prerequisites/config.yaml +++ b/examples/distributed-loki/1-prerequisites/config.yaml @@ -16,6 +16,8 @@ data: http_listen_port: 3100 http_server_read_timeout: 1m http_server_write_timeout: 1m + grpc_server_max_recv_msg_size: 8388608 + grpc_server_max_send_msg_size: 8388608 log_level: error chunk_store_config: max_look_back_period: 0s @@ -33,6 +35,7 @@ data: frontend_worker: frontend_address: loki-distributed-query-frontend:9095 ingester: + max_chunk_age: 2h chunk_block_size: 262144 chunk_encoding: snappy chunk_idle_period: 30m @@ -49,6 +52,7 @@ data: join_members: - loki-distributed-memberlist query_range: + parallelise_shardable_queries: true align_queries_with_step: true cache_results: true max_retries: 5 @@ -58,6 +62,8 @@ data: fifocache: max_size_bytes: 500MB validity: 24h + query_scheduler: + max_outstanding_requests_per_tenant: 2048 ruler: alertmanager_url: https://alertmanager.xx external_url: https://alertmanager.xx @@ -104,7 +110,7 @@ data: max_line_size_truncate: false max_entries_limit_per_query: 10000 max_streams_per_user: 0 - max_global_streams_per_user: 0 + max_global_streams_per_user: 25000 unordered_writes: true max_chunks_per_query: 2000000 max_query_length: 721h diff --git a/loki_config.md b/loki_config.md new file mode 100644 index 0000000..d608197 --- /dev/null +++ b/loki_config.md @@ -0,0 +1,144 @@ +# Custom Loki Configuration for NetObserv + +Grafana Loki is configured in a YAML file which contains information on the Loki server and its individual components. + +Some of these need to be tweaked for network observability according to your cluster size, number of flows and sampling. + +## How to update configs + +Use the following commands to update your loki config in an easy way. + +### Using Zero Click + +Update [zero-click-loki/2-loki.yaml](./examples/zero-click-loki/2-loki.yaml) or Update [zero-click-loki/2-loki-tls.yaml](./examples/zero-click-loki/2-loki-tls.yaml)with your custom configuration. + +Then replace the related config using: +```bash +oc replace --force -f config.yaml +``` + +The pod will restart automatically. + +### Using Zero Click / Distributed Loki + +Update [distributed-loki/1-prerequisites/config.yaml](./examples/distributed-loki/1-prerequisites/config.yaml) with your custom configuration. + +Then replace the config using: +```bash +oc replace --force -f config.yaml +``` + +Restart all pods of `loki` instance: +```bash +oc delete pods --selector app.kubernetes.io/instance=loki -n netobserv +``` + +### Using Loki Operator + +LokiStack needs to be configured as `Unmanaged` management state first to allow configmap updates. + +Run the following command to get the `lokistack-config` configmap in `netobserv` namespace: +```bash +oc get configmap lokistack-config -n netobserv -o yaml | yq '.binaryData | map_values(. | @base64d)' > binaryData.txt +``` + +Update the binaryData.txt file accordingly. + +Then run the following command to update `lokistack-config` configmap in `netobserv` namespace using the updated file: +```bash +BINARY_CONFIG=$(yq -o=json -I=0 'map_values(. | @base64)' binaryData.txt) && echo $BINARY_CONFIG +oc patch configmap lokistack-config -n netobserv -p '{"binaryData":'$BINARY_CONFIG'}' +``` + +Restart all pods of `lokistack` instance: +```bash +oc delete pods --selector app.kubernetes.io/name=lokistack -n netobserv +``` + +## Wide time range queries - too many outstanding requests + +> The query frontend splits larger queries into multiple smaller queries, executing these queries in parallel on downstream queriers and stitching the results back together again. This prevents large (multi-day, etc) queries from causing out of memory issues in a single querier and helps to execute them faster. + +Check [Grafana official documentation](https://grafana.com/docs/loki/latest/fundamentals/architecture/components/#splitting) + +Some queries may be limited by the query scheduler. You will need to update the following configuration: + +```yaml + query_range: + parallelise_shardable_queries: true + query_scheduler: + max_outstanding_requests_per_tenant: 100 +``` + +Ensure `parallelise_shardable_queries` is set to `true` and increase `max_outstanding_requests_per_tenant` following your needs (default = 100). It's reasonable to put a high value here as `2048` for example, however it will decrease multi-users queries performance. + +Check [query_scheduler](https://grafana.com/docs/loki/latest/configuration/#query_scheduler) configuration for more details. + +## Bulk messages - gRPC received message larger than max + +The messages containing bulks of records received by Loki distributor and exchanged between components have a maximum size in bytes set by the following parameters: + +```yaml + server: + grpc_server_max_recv_msg_size: 4194304 + grpc_server_max_send_msg_size: 4194304 +``` + +By default the size is `4194304` = `4Mb`. It's reasonable to increase it to `8388608` = `8Mb`. + +## Delay - Entry too far behind for stream + +While collecting flows and enriching them, a latency appears between the current time and records. This particularly applies when using Kafka on large clusters. + +Loki can be configured to reject old samples using the following configuration: + +```yaml + limits_config: + reject_old_samples_max_age: 168h +``` + +On top of that, Loki logs are written in chunks in order by time. If a message received is too old than the most recent one, it will be `out-of-order`. + +To accept messages within a specific time range, use the following configuration: + +```yaml + ingester: + max_chunk_age: 2h +``` + +Be careful, Loki calculates the earliest time that out-of-order entries may have and be accepted with: +``` +time_of_most_recent_line - (max_chunk_age / 2) +``` + +Check [accept out-of-order writes documentation](https://grafana.com/docs/loki/latest/configuration/#accept-out-of-order-writes) for more info. + +## Maximum active stream limit exceeded + +The number of active streams can be limited per user per ingester (unlimited by default) or per user across the cluster (default = 5000). + +To update these limits, you can tweak the following values: + +```yaml + limits_config: + max_streams_per_user: 0 + max_global_streams_per_user: 5000 +``` + +It's not recommended to disable both using `0`. You may set `max_streams_per_user` to `5000` using multiple ingesters and disable `max_global_streams_per_user` or increase `max_global_streams_per_user` value instead. + +Check [limits_config](https://grafana.com/docs/loki/latest/configuration/#limits_config) for more details. + +## Ingestion rate limit exceeded + +Ingestion is limited in terms of sample size per second as `ingestion_rate_mb` and per distributor local rate as `ingestion_burst_size_mb`: + +```yaml + limits_config: + ingestion_rate_mb: 4 + ingestion_burst_size_mb: 6 +``` + +It's common to put more than `10Mb` on each. You can safely increase these two values but keep an eye on your ingester performances and on your storage size. + +Check [limits_config](https://grafana.com/docs/loki/latest/configuration/#limits_config) for more details. \ No newline at end of file