Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,17 @@
# Change Log

## 2.0.5 - 2022-04-12
### Added
- Prometheus metrics support for multi worker configuration.
- 'FAQ' section to help customers in triaging issues when encountered.
### Changed
- Supporting both 'memory' and 'file' buffer types. The recommended and default buffer type is still 'file'.
- Using Yajl over default JSON library for handling multi byte character logs gracefully.
- Plugin log defaults to STDOUT
- 'oci-logging-analytics.log' file will no longer be created when 'plugin_log_location' in match block section is not provided explicitly.
### Bug fix
- 'tag' field not mandatory in filter block.

## 2.0.4 - 2022-06-20
### Changed
- Updated prometheus-client dependency to v4.0.0.
Expand Down
227 changes: 227 additions & 0 deletions FAQ.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,227 @@
# Frequently Asked Questions

- [Why am I getting this error - "Error occurred while initializing LogAnalytics Client" ?](#why-am-i-getting-this-error---error-occurred-while-initializing-loganalytics-client-)
- [Why am I getting this error - "Error occurred while parsing oci_la_log_set" ?](#why-am-i-getting-this-error---error-occurred-while-parsing-oci_la_log_set-)
- [Why am I getting this error - "Error while uploading the payload" or "execution expired" or "status : 0" ?](#why-am-i-getting-this-error---error-while-uploading-the-payload-or--execution-expired-or--status--0-)
- [How to find fluentd/output plugin logs ?](#how-to-find-fluentdoutput-plugin-logs-)
- [Fluentd successfully uploaded data, but still it is not visible in LogExplorer. How to triage ?](#fluentd-successfully-uploaded-data-but-still-it-is-not-visible-in-logexplorer-how-to-triage-)
- [How to extract specific K8s metadata field that I am interested in to a Logging Analytics field ?](#how-to-extract-specific-k8s-metadata-field-that-i-am-interested-in-to-a-logging-analytics-field-)
- [How to make Fluentd process the log data from the beginning of a file when using tail input plugin ?](#how-to-make-fluentd-process-the-log-data-from-the-beginning-of-a-file-when-using-tail-input-plugin-)
- [How to make Fluentd process the last line from the file when using tail input plugin ?](#how-to-make-fluentd-process-the-last-line-from-the-file-when-using-tail-input-plugin-)
- [In multi worker setup, prometheus is not displaying all the worker's metrics. How to fix it ?](#in-multi-worker-setup-prometheus-is-not-displaying-all-the-workers-metrics-how-to-fix-it-)
- [Why am I getting this error - "ConcatFilter::TimeoutError" ?](#why-am-i-getting-this-error---concatfiltertimeouterror-)
- [Fluentd is failing to parse the log data. What can be the reason ?](#fluentd-is-failing-to-parse-the-log-data-what-can-be-the-reason-)


## Why am I getting this error - "Error occurred while initializing LogAnalytics Client" ?
- This occurs mostly due to incorrect authorization type/configuration provided.
- This plugin uses either config based auth or Instance principal based auth with default being Instance principal based auth.
- For config based auth ensure valid "config_file_location" and "profile" details are provided in match block as shown below.
```
<match oci.**>
@type oci-logging-analytics
config_file_location #REPLACE_ME
profile_name DEFAULT
</match>
```

## Why am I getting this error - "Error occurred while parsing oci_la_log_set" ?
- The provided Regex do not match the key coming in and the regex might need a correction.
- This might be expected behaviour with the regex configured where not all keys need to be matched. In such scenarios we fall back to use logSet parameter set using oci_la_log_set.
- You may also apply logSet using alternative approach documented [here](https://docs.oracle.com/en-us/iaas/logging-analytics/doc/manage-log-partitioning.html#LOGAN-GUID-2EC8EEDE-9BBD-4872-8083-A44F77611524)

## Why am I getting this error - "Error while uploading the payload" or "execution expired" or "status : 0" ?
- Sample logs:
```
I, [2023-01-18T10:39:49.483789 #11] INFO -- : Received new chunk, started processing ...
I, [2023-01-18T10:39:49.495771 #11] INFO -- : Generating payload with 31 records for oci_la_log_group_id: ocid1.loganalyticsloggroup.oc1.iad.amaaaaaa....
E, [2023-01-18T10:39:59.502747 #11] ERROR -- : oci upload exception : Error while uploading the payload. { 'message': 'execution expired', 'status': 0, 'opc-request-id':'C37D1DE643E24D778FC5FA22835FE024', 'response-body': '' }
```

- This occurs due to connectivity to OCI endpoint. Ensure the proxy details are provided are valid if configured, or you have network connectivity to reach the OCI endpoint from where you are running the fluentd.

## How to find fluentd/output plugin logs ?
- By default (starting from 2.0.5 version), oci logging analytics output plugin logs goes to STDOUT and available as part of the fluentd logs itself, unless it is explicitly configured using the following plugin parameter.
```
plugin_log_location "#{ENV['FLUENT_OCI_LOG_LOCATION'] || '/var/log'}"
# Log file named 'oci-logging-analytics.log' will be generated in the above location
```
- For td-agent (rpm/deb) based setup, the fluentd logs are located at /var/log/td-agent/td-agent.log


## Fluentd successfully uploaded data, but still it is not visible in LogExplorer. How to triage ?
- Check if selected time range in log explorer is in line with the actual log messages timestamp.
- Check after some time - As the processing of the data happens asynchronously, there are cases it may take some time to reflect the data in log explorer.
- The processing of the data may fail in the subsequent validations which happens at OCI.
- Check for any [processing errors](https://docs.oracle.com/en-us/iaas/logging-analytics/doc/troubleshoot-ingestion-pipeline.html).
- If the issue is still persistent, raise an SR by providing the following information. tenency_ocid, region, sample opc-request-id/opc-object-id
- You may get the opc-request-id/opc-object-id from fluentd output plugin log. The sample log for a successful upload looks like below,

```
I, [2023-01-18T10:39:49.483789 #11] INFO -- : Received new chunk, started processing ...
I, [2023-01-18T10:39:49.495771 #11] INFO -- : Generating payload with 30 records for oci_la_log_group_id: ocid1.loganalyticsloggroup.oc1.iad.amaaaaaa....
I, [2023-01-18T10:39:59.502747 #11] INFO -- : The payload has been successfully uploaded to logAnalytics -
oci_la_log_group_id: ocid1.loganalyticsloggroup.oc1.iad.amaaaaaa....,
ConsumedRecords: #30,
opc-request-id':'C37D1DE643E24D778FC5FA22835FE024',
opc-object-id: 'C37D1DE643E24D778FC5FA22835FE024-D37D1DE643E24D778FC5FA22835FE024'"
```
## How to extract specific K8s metadata field that I am interested in to a Logging Analytics field ?
- We can get this kind of scenario when collecting the logs from Kubernetes clusters and using kubernetes_metadata_filter to enrich the data at fluentd.
- By default, fluentd output plugin will fetch following fields "container_name", "namespace_name", "pod_name", "container_image", "host" from kubernetes metadata when available, and maps them to following fields "Container", "Namespace", "Pod", "Container Image Name", "Node".
- In case if a new field is needed to be extracted, or to modify the default mappings, add "kubernetes_metadata_keys_mapping" in match block like shown below.
- When you are adding a new field mapping, ensure the corresponding Loggingg Analytics field is already defined.

```
<match oci.**>
@type oci-logging-analytics
nameSpace namespace #REPLACE_ME
config_file_location ~/.oci/config #REPLACE_ME
profile_name DEFAULT
kubernetes_metadata_keys_mapping {"container_name":"Container","namespace_name":"Namespace","pod_name":"Pod","container_image":"Container Image Name","host":"Node"}
<buffer>
@type file
path /var/log/fluent_oci_outplugin/buffer/ #REPLACE_ME
disable_chunk_backup true
</buffer>
</match>
```

## How to make Fluentd process the log data from the beginning of a file when using tail input plugin ?
- The default behaviour of the tail plugin is to read from the latest(tail). This behaviour can be altered by modifying the "read_from_head" parameter.
- The below is an example tail plugin configuration to read from beginning of a file named foo.log

```
<source>
@type tail
<parse>
@type none
</parse>
path foo.log
pos_file foo.pos
tag oci.foo
read_from_head true
</source>
```

## How to make Fluentd process the last line from the file when using tail input plugin ?
- In case of multi-line events, for last line, log consumption might be delayed until the next log message is written to the log file. Fluentd will only parse the last line when a line break is appended at the end of the line.
- To fix this, add/increase multiline_flush_interval property in source block.

```
<source>
@type tail
multiline_flush_interval 5s
<parse>
@type multiline
format_firstline /\d{4}-[01]\d-[0-3]\d\s[0-2]\d((:[0-5]\d)?){2}\s+(\w+)\s+\[([\w-]+)?\]\s([\w._$]+)\s+([-\w]+)\s+(.*)/
format1 /^(?<message>.*)/
</parse>
path foo.log
pos_file foo.pos
tag oci.foo
read_from_head true
</source>
```

## In multi worker setup, prometheus is not displaying all the worker's metrics. How to fix it ?
- In case of multi worker setup, each worker needs its own port binding. To ensure prometheus engine is scraping metrics from multiple ports, provide "aggregated_metrics_path /aggregated_metrics" in prometheus source config as shown below.
- Multi worker config example

```
<source>
@type prometheus
bind 0.0.0.0
port 24231
aggregated_metrics_path /aggregated_metrics
</source>
```

- Single worker config example

```
<source>
@type prometheus
bind 0.0.0.0
port 24231
metrics_path /metrics
</source>
```

## Why am I getting this error - "ConcatFilter::TimeoutError" ?
- This error occurs when using Concat plugin to handle multiline log messages.
- When the incoming log flow is very slow, then concat plugin throws this error to avoid waiting indefinitely for the next multiline start expression match.
- By increasing the flush_interval for this concat filter to appropriate value, this issue can be avoided.
- We recommend usage of "timeout_label" to redirect the corresponding log messages and handle them appropriately to avoid the data loss.
- When using "timeout_label", you may ignore this error.

```
# Concat filter to handle multi-line log records.
<filter oci.apache.concat>
@type concat
key message
flush_interval 15
timeout_label @NORMAL
multiline_start_regexp /\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{3}Z/
separator ""
</filter>

<label @NORMAL>
<match>
@type oci-logging-analytics
namespace <YOUR_OCI_TENANCY_NAMESPACE>
config_file_location ~/.oci/config
profile_name DEFAULT
<buffer>
@type file
path /var/log
retry_forever true
disable_chunk_backup true
</buffer>
</match>
</label>
```


## Fluentd is failing to parse the log data. What can be the reason ?
- In the source-block check the regex/format_firstline expression and check if it matches with your log data.
- Sample logs:
```
2021-02-06 01:44:03 +0000 [warn]: #0 dump an error event:
error_class=Fluent::Plugin::Parser::ParserError error="pattern not match with data <log message>
```
- Multiline

```
<source>
@type tail
multiline_flush_interval 5s
<parse>
@type multiline
format_firstline /\d{4}-[01]\d-[0-3]\d\s[0-2]\d((:[0-5]\d)?){2}\s+(\w+)\s+\[([\w-]+)?\]\s([\w._$]+)\s+([-\w]+)\s+(.*)/
format1 /^(?<message>.*)/
</parse>
path foo.log
pos_file foo.pos
tag oci.foo
read_from_head true
</source>
```

- Regexp

```
<source>
@type tail
multiline_flush_interval 5s
# regexp
<parse>
@type regexp
expression ^(?<name>[^ ]*) (?<user>[^ ]*) (?<age>\d*)$
</parse>
path foo.log
pos_file foo.pos
tag oci.foo
read_from_head true
</source>
```
43 changes: 36 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ Refer [Prerequisites](https://docs.oracle.com/en/learn/oci_logging_analytics_flu
<th>rubyzip</th>
<th>oci</th>
<th>prometheus-client</th>
<th>yajl-ruby</th>
</tr>
<tr>
<td>>= 2.0.0</td>
Expand All @@ -29,6 +30,7 @@ Refer [Prerequisites](https://docs.oracle.com/en/learn/oci_logging_analytics_flu
<td>~> 2.3.2 </td>
<td>~> 2.16</td>
<td>~> 2.1.0</td>
<td>~>1.4.3</td>
</tr>
</table>

Expand Down Expand Up @@ -59,6 +61,21 @@ Or install it manually as:
### Buffer Configuration

- [Buffer configuration parameters](https://docs.oracle.com/en/learn/oci_logging_analytics_fluentd/#buffer-configuration-parameters)
- Note* - Buffer type 'file' is recommended but no longer mandatory. Customer can configure other available options.

#### Advantages of 'file' based Buffer plugin
- In case of a fast input plugin and slow output plugin, buffer will keep on increase and file based buffer with default 50GB (can configure for higher value) can handle the output plugin delay.
We may not have huge memory configured for memory based plugin and will result in data loss.

- In case of back-end service not available (5XX exceptions), output plugin will keep retrying with the existing chunk and meanwhile, new chunks will be keep on getting scheduled.
Each chunk being 2MB size, and with chunk interval 30 sec, in case of a 30 mins outage (can be more in unforeseen cases), we need a minimum of 120MB memory allocated for memory buffer.

- In any container based logging analytics, as the logs are not saved in the containers and they are completely lost in case not consumed, file based memory provides a persistent buffer implementation.

- For container based deployments, while creating the fluentd config file, we need to consider all these edge cases and proper sizing to come up with the memory size.

- As the data loss is very critical and not all customers are aware of these cases, we prevented memory buffer. Having said that, we can go ahead and remove that limitation and let the informed customers like you can do the proper sizing and decide which option is helpful for them.


### Input Plugin Configuration

Expand Down Expand Up @@ -89,35 +106,43 @@ Refer [Viewing the Logs in Logging Analytics](https://docs.oracle.com/en/learn/o

## Metrics

The plugin emits following metrics in Prometheus format, which provides stats/insights about the data being collected and processed by the plugin. Refer [monitoring-prometheus](https://docs.fluentd.org/monitoring-fluentd/monitoring-prometheus) for details on how to expose these and other various Fluentd metrics to Prometheus (*If the requirement is to collect and monitor core Fluentd and this plugin metrics alone using Prometheus then Step1 and Step2 from the referred document can be skipped*).
The plugin emits following metrics in Prometheus format, which provides stats/insights about the data being collected and processed by the plugin.
Refer [monitoring-prometheus](https://docs.fluentd.org/monitoring-fluentd/monitoring-prometheus) for details on how to expose these and other various Fluentd metrics to Prometheus (*If the requirement is to collect and monitor core Fluentd and this plugin metrics alone using Prometheus then Step1 and Step2 from the referred document can be skipped*).

#### Note
For prometheus metrics to work properly, please add 'tag' and 'worker_id' (in case of multi worker configuration) to the filter block.

tag ${tag}
worker_id ${ENV['SERVERENGINE_WORKER_ID']}

#### Metrics details
Metric Name: oci_la_fluentd_output_plugin_records_received
labels: [:tag,:oci_la_log_group_id,:oci_la_log_source_name,:oci_la_log_set]
labels: [:worker_id,:tag,:oci_la_log_group_id,:oci_la_log_source_name,:oci_la_log_set]
Description: Number of records received by the OCI Logging Analytics Fluentd output plugin.
Type : Gauge

Metric Name: oci_la_fluentd_output_plugin_records_valid
labels: [:tag,:oci_la_log_group_id,:oci_la_log_source_name,:oci_la_log_set]
labels: [:worker_id,:tag,:oci_la_log_group_id,:oci_la_log_source_name,:oci_la_log_set]
Description: Number of valid records received by the OCI Logging Analytics Fluentd output plugin.
Type : Gauge

Metric Name: oci_la_fluentd_output_plugin_records_invalid
labels: [:tag,:oci_la_log_group_id,:oci_la_log_source_name,:oci_la_log_set,:reason]
labels: [:worker_id,:tag,:oci_la_log_group_id,:oci_la_log_source_name,:oci_la_log_set,:reason]
Description: Number of invalid records received by the OCI Logging Analytics Fluentd output plugin.
Type : Gauge

Metric Name: oci_la_fluentd_output_plugin_records_post_error
labels: [:tag,:oci_la_log_group_id,:oci_la_log_source_name,:oci_la_log_set,:error_code, :reason]
labels: [:worker_id,:tag,:oci_la_log_group_id,:oci_la_log_source_name,:oci_la_log_set,:error_code, :reason]
Description: Number of records failed posting to OCI Logging Analytics by the Fluentd output plugin.
Type : Gauge

Metric Name: oci_la_fluentd_output_plugin_records_post_success
labels: [:tag,:oci_la_log_group_id,:oci_la_log_source_name,:oci_la_log_set]
labels: [:worker_id,:tag,:oci_la_log_group_id,:oci_la_log_source_name,:oci_la_log_set]
Description: Number of records posted by the OCI Logging Analytics Fluentd output plugin.
Type : Gauge

Metric Name: oci_la_fluentd_output_plugin_chunk_time_to_receive
labels: [:tag]
labels: [:worker_id,:tag]
Description: Average time taken by Fluentd to deliver the collected records from Input plugin to OCI Logging Analytics output plugin.
Type : Histogram

Expand All @@ -126,6 +151,10 @@ The plugin emits following metrics in Prometheus format, which provides stats/in
Description: Average time taken for posting the received records to OCI Logging Analytics by the Fluentd output plugin.
Type : Histogram

## FAQ

See [FAQ](FAQ.md).


## Changes

Expand Down
Loading