# Apache Druid logging

Log configuration and maintenance is an essential task for operators of an Apache Druid cluster. Druid uses the [Apache Log4j](https://logging.apache.org/log4j/2.x/) logging framework to emit logs that are useful for day-to-day monitoring and for troubleshooting. They not only enable you to investigate issues and solve problems, but to understand how each of Druid processes work in isolation and in collaboration with one another.

In this notebook, you will take a tour of the out-of-the-box Log4J configuration in Apache Druid and use terminal commands to locate and examine its contents.

## Prerequisites

This tutorial works with Druid 30.0.0 or later. It is designed to run from a Mac with a locally running instance of Druid.

If you wish to use this tutorial within Jupyter through the [learn-druid](https://github.com/implydata/learn-druid) Docker Compose, use the `jupyter` profile to avoid starting a second instance of Druid that may cause conflicts.

## Initialization

In this step, you will find instructions to install prerequisite tools and to deploy Druid locally.

Before starting, open a terminal window.

### Install required tools

You will need the following tools:

* `brew` to install prerequisite tools.
* `wget` to pull Apache Druid from the official repository.
* `multitail` to view multiple logs files simultaneously.

For instructions on installing `brew`, see the [Homebrew homepage](https://brew.sh/).

Install `wget` and `multitail` using `brew`. For example:

```bash
brew install multitail
brew install wget
```

You may need to manually fetch a default configuration for `multitail`.

Skip this step if you are already running `multitail` as it will overwrite your own configuration.

Execute the following command to pull the default configuration to your home folder.

```bash
curl https://raw.githubusercontent.com/halturin/multitail/master/multitail.conf > ~/.multitailrc
```

### Install Apache Druid

Run the following to create a dedicated folder for learn-druid in your home directory:

```bash
cd ~ ; mkdir learn-druid-local
cd learn-druid-local
```

Pull and unpack a compatible version of Apache Druid.

```bash
wget https://dlcdn.apache.org/druid/30.0.0/apache-druid-30.0.0-bin.tar.gz
tar -xzf apache-druid-30.0.0-bin.tar.gz
```

Use the following commands to rename the folder.

```bash
mv apache-druid-30.0.0 apache-druid
cd apache-druid
```

# Review the log file configuration

The log file configuration is set in the `log4j2.xml` alongside Druid configuration files.

Run this command to view the `auto` configuration file for logs used by the `learn-druid` script:

```bash
more ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml
```

The [`Configuration`](https://logging.apache.org/log4j/log4j-2.4/manual/configuration.html#ConfigurationSyntax) element contains the following elements:

* [`Properties`](https://logging.apache.org/log4j/2.x/manual/configuration.html#PropertySubstitution) provide key/values pairs that may be used throughout the configuration file.
* [`Appenders`](https://logging.apache.org/log4j/2.x/manual/appenders.html) designate the format of log messages and determine the target for the messages.
* [`Loggers`](https://logging.apache.org/log4j/2.x/manual/configuration.html#Loggers) filter the log messages and dispense them to Appenders. Loggers can filter messages based on the Java package and/or class and by message priority.

## Properties

Druid uses the `Properties` element to set a location for all logs. The location for log files is available at start-up.

By default, this location is a "log" folder at the root of your Druid installation. You can overwrite it using the [log directory](https://druid.apache.org/docs/latest/configuration/logging/#log-directory).

## Appenders

There are two `appenders`:

* "Console": [`Console`](https://logging.apache.org/log4j/log4j-2.4/manual/appenders.html#ConsoleAppender) appender for `SYSTEM_OUT`
* "FileAppender": [`RollingRandomAccessFile`](https://logging.apache.org/log4j/log4j-2.4/manual/appenders.html#RollingRandomAccessFileAppender) appender for detailed process logs

### Start a Druid instance

The default Configuration for Druid does not include a [`monitorInterval`](https://logging.apache.org/log4j/log4j-2.4/manual/configuration.html#AutomaticReconfiguration) property, so changes to logging configuration are only recognised when a process restarts.

Run the following command to add a `monitorInterval` property to the Configuration:

```bash
sed -i '' 's/<Configuration status="WARN">/<Configuration status="WARN" monitorInterval="5">/' \
  ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml
```

Start Druid with the following command:

```bash
nohup ~/learn-druid-local/apache-druid/bin/start-druid & disown > log.out 2> log.err < /dev/null
```

### Look at the standard log files

Since Druid is a distributed system, it contains log files for each Druid process. In addition, Druid also captures the output written to the standard output. Use the following command to take a look at what is here:

```bash
cd ~/learn-druid-local/apache-druid/log ; ls
```

This results in two sets of files being created:

* `<process name>.stdout.log`: file containing information written by the processes to the output stream
* `<process name>.log`: file containing status, error, warning, and debug messages

### Log filenames

The "FileAppender" `RollingRandomAccessFile` appender has both `fileName` and `filePattern` properties. The `fileName` property sets the name of the log file being written to at the moment, while `filePattern` is applied when the log rolls over.

Run the following command to change the default filename of the "FileAppender" in the Log4j configuration file so that all future detailed process logs will have a name suffixed with the hostname.

```bash
sed -i '' 's/{sys:druid.node.type}.log/{sys:druid.node.type}-${hostName}.log/' \
  ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml
```

List the contents of the log folder again to see the changes:

```bash
ls
```

Revert the change with the following command:

```bash
sed -i '' 's/{sys:druid.node.type}-${hostName}.log/{sys:druid.node.type}.log/' \
  ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml
```

Since "FileAppender" is a [RollingRandomAccessFileAppender](https://logging.apache.org/log4j/log4j-2.4/manual/appenders.html#RollingRandomAccessFileAppender), you can adjust the `TimeBasedTriggeringPolicy` to change when `fileName` log files are rolled over to `filePattern` log files.

Run the following in your terminal to adjust the `filePattern` to include the hours and minutes:

```bash
sed -i '' 's/{yyyyMMdd}/{yyyyMMdd-HH:mm}/' \
  ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml
```

Run this command to see the new file pattern:

```bash
grep filePattern ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml
```

Since the `TimeBasedTriggeringPolicy` is set to 1 by default, a change in the least granular element of the `filePattern` triggers a rollover.

Run the following command a few times to see files being created every minute:

```bash
ls -l
```

Use other Log4j policies in the usual way. For example, you can adjust the `filePattern` to `%d{yyyyMMdd-HH:mm}-%i.log` and then use [`SizeBasedTriggeringPolicy`](https://logging.apache.org/log4j/2.x/manual/appenders.html#sizebased-triggering-policy) instead of a `TimeBasedTriggeringPolicy` to have log files emitted when a log file hits a particular size, rather than based on the timestamp.

### Log patterns

The content of log files is specified in the [`PatternLayout`](https://logging.apache.org/log4j/2.x/manual/layouts.html#pattern-layout).

Run the following command to see the current Log4J configuration for `PatternLayout` for the "Console":

```bash
xmllint -xpath Configuration/Appenders/Console/PatternLayout ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml
```

Each element refers to a specific piece of recordable information.

* Timestamp (`%d{ISO8601}`)
* Log level (`%p`)
* Thread name (`[%t]`)
* Logger name (`%c`)
* Message (`%m%n`)

Run this command to follow along as log entries are added to the process logs for the Coordinator, Overlord, and Broker:

```bash
multitail -CS log4jnew -du -P a \
    -f coordinator-overlord.log \
    -f broker.log
```

You can assign different log levels to different entries in the log to indicate how severe a log entry is. In the log, you will see entries with these levels:

* FATAL (system failure)
* ERROR (functional failure)
* WARN (non-fatal issue)
* INFO (notable event)
* DEBUG (program debugging messages)
* TRACE (highly granular execution event)

The thread name and logger name are helpful for diagnosis, especially for WARN and ERROR conditions.

Messages provide insights into events, process states, and significant variables within the system.

In your `multitail` window, press `q` to quit and return to the terminal.

### Log retention

Since "FileAppender" is a [RollingRandomAccessFileAppender](https://logging.apache.org/log4j/log4j-2.4/manual/appenders.html#RollingRandomAccessFileAppender), you can adjust `DefaultRolloverStrategy` to control retention of logs by adjusting the `Delete` section.

`IfFileName` and `IfLastModified` are used in conjunction to remove any files from the log folder that match the rules. The default is to remove matching files older than two months, based on the date last modified.

Run the following command to change the trigger policy to a [duration](https://logging.apache.org/log4j/2.x/javadoc/log4j-core/org/apache/logging/log4j/core/appender/rolling/action/Duration.html#parseCharSequence) of "PT2M". As a consequence, any files matching the `IfFileName` filter older than two minutes will be deleted automatically by Log4J.

Looking back at your previous `ls -l` results, run this command and repeat to see how the `Delete` rule is applied.

```bash
sed -i '' 's/IfLastModified age="7d"/IfLastModified age="PT2M"/' \
  ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml
```

Notice how the Overlord removes all files, not just the rollover logs.

```bash
ls
```

Run the following command to change the retention policy to one day:

```bash
sed -i '' 's/IfLastModified age="PT2M"/IfLastModified age="P1D"/' \
  ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml
```

Run the following command to return the file pattern to the default:

```bash
sed -i '' 's/{yyyyMMdd-HH:mm}/{yyyyMMdd}/' \
  ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml
```

## Loggers

This section of the configuration controls what types of events are logged and what data is recorded.

Before beginning, restart your Druid instance to recreate any logs that were deleted by the retention rules above.

Run the following command to kill your Druid instance:

```bash
kill $(ps -ef | grep 'supervise' | awk 'NF{print $2}' | head -n 1)
```

Restart Druid with the following command:

```bash
nohup ~/learn-druid-local/apache-druid/bin/start-druid & disown > log.out 2> log.err < /dev/null
```

### Logging level

The out-of-the-box Druid configuration for Log4J sets a `Root` level for the `FileAppender` of `INFO`, meaning that only messages with a level of `INFO` and above are recorded.

Run this command to see the configuration for your instance:

```bash
xmllint -xpath Configuration/Loggers/Root ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml
```

Other base levels are set at a class level, reducing log noise. For example:

```xml
    <!-- Quieter KafkaSupervisors -->
    <Logger name="org.apache.kafka.clients.consumer.internals" level="warn" additivity="false">
        <Appender-ref ref="FileAppender"/>
    </Logger>
```

Run this command to monitor several log files:

```bash
multitail -CS log4jnew -du -P a -s 2 -sn 1,3 \
    -f coordinator-overlord.log \
    -f broker.log \
    -f middlemanager.log \
    -f historical.log
```

Open a new terminal window and run this command to amend the base logging level for all Druid processes:

```bash
sed -i '' 's/Root level="info"/Root level="debug"/' \
  ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml
```

Within a short period of time you will see many more log messages appearing.

Revert the logging level to INFO before proceeding.

```bash
sed -i '' 's/Root level="debug"/Root level="info"/' \
  ~/learn-druid-local/apache-druid/conf/druid/auto/_common/log4j2.xml
```

Leave `multitail` running and your second terminal open.

## Examples

Two recommended approaches for working with log files are:

1. Search individual log files for indications of a problem, such as WARN, ERROR, and FATAL, and of Java exceptions in stack traces (e.g. `java.net.ConnectionException: Connection refused`.
2. Read log files more like a novel, starting at the beginning of the file or some known intermediate point and follow the story logically.

In the first approach, it's possible to work back through the history, using the time field as a key. You may notice large time gaps, or use filtering to remove any events from threads or classes that you know are not relevant to the error.

The second approach requires more time, but helps to solve more complex problems. It's also an important learning aid when delving into how Druid processes collaborate to realize services.

This section contains examples of using log files to diagnose issues with Druid.

### Failure in Apache ZooKeeper

Druid processes rely on [Apache ZooKeeper](https://zookeeper.apache.org/) for [inter-process communication and configuration](https://druid.apache.org/docs/latest/dependencies/zookeeper.html). Run the following commands to simulate a failure in ZooKeeper.

* The first `kill` prevents ZooKeeper from being recovered by the `learn-druid` script's `supervisor`.
* The second `kill` stops the ZooKeeper process.

```bash
kill -STOP $(ps -ef | grep 'perl' | awk 'NF{print $2}' | head -n 1)
kill $(ps -ef | grep 'zoo' | awk 'NF{print $2}' | head -n 1)
```

In your `multitail` window, you will now see multiple `INFO` messages relating to ZooKeeper connection issues, such as:

```
2024-02-22T11:54:31,662 WARN [main-SendThread(localhost:2181)] org.apache.zookeeper.ClientCnxn - Session 0x10001277b290002 for server localhost/[0:0:0:0:0:0:0:1]:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.

java.net.ConnectException: Connection refused
```

Run the following command to restart the "supervisor", which will restart ZooKeeper automatically.

```bash
kill -CONT $(ps -ef | grep 'perl' | awk 'NF{print $2}' | head -n 1)
```

You can use the `-e` switch on `multitail` to filter the logs that are displayed. For example, try closing your `multitail` window with `q` and run the following command to filter logs to only show WARN messages in the Coordinator and Overlord, and to ignore all INFO messages on the Broker, Middle Manager, and Historical. You may then want to repeat the above.

```bash
multitail -CS log4jnew -du -P a -s 2 -sn 1,3 \
    -e "WARN" -f coordinator-overlord.log \
    -ev "INFO" -f broker.log \
    -ev "INFO" -f middlemanager.log \
    -ev "INFO" -f historical.log
```

# Clean up

Run the following command to stop Druid:

```bash
kill $(ps -ef | grep 'supervise' | awk 'NF{print $2}' | head -n 1)
```

Remove the `learn-druid-local` folder from your home folder in the usual way.

## Learn more


For more information:

* Watch [Druid optimizations for scaling customer facing analytics at Conviva](https://youtu.be/zkHXr-3GFJw?t=746) by Amir Youssefi and Pawas Ranjan from Conviva that describes how useful this information can be to tuning Druid clusters.
* Read about Druid [logging](https://druid.apache.org/docs/latest/configuration/logging.html) in the official documentation.
* See more ways to use and run `multitail` on the [official site](https://www.vanheusden.com/multitail/index.html).
* Read about Log4j in [Apache Logging Services](https://logging.apache.org/) documentation.
* Review [simple date format](https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html) for `filePattern`.
* Review [pattern layout](https://logging.apache.org/log4j/2.x/manual/layouts.html#pattern-layout) for `PatternLayout`.

Notice that the Logger Name shows a fully-qualified class name. Here are some examples of searches you can conduct on your logs to deepen your knowledge of what your instance is doing.

| Log | Search Term |
|---|---|
| Any | __NodeRoleWatcher__<br>Across all the processes, watch as they detect changes in the processes that are running in the cluster, and see what they do about it. |
| Any | __org.apache.druid.initialization.Initialization__<br>These messages are all about the process starting up. It can be interesting to see what exactly each one does and if it runs into issues. |
| Coordinator / Overlord | __org.apache.druid.metadata.SQLMetadataRuleManager__<br>This is the coordinator polling the rules in the metadata database, getting ready to apply them. The log tells you how many rules it picks up and how many data sources they cover. |
| Coordinator / Overlord | __org.apache.druid.metadata.SqlSegmentsMetadataManager__<br>Messages show how many segments the cluster thinks are “used” – ready to be used for queries. |
| Coordinator / Overlord | __org.apache.druid.indexing.overlord.RemoteTaskRunner__<br>GIves interesting information about what’s happening with ingestion resources in the cluster, including when they first advertise themselves. |
| Coordinator / Overlord | __org.apache.druid.server.coordinator.rules__<br>Lots of information about how the retention rules are applied. |
| Coordinator / Overlord | __org.apache.druid.server.coordinator.duty.BalanceSegments__<br>Here you can see what Druid decides to do when balancing the workload, such as a server is lost or added. |
| Historical | __org.apache.druid.server.coordination.BatchDataSegmentAnnouncer__<br>You can see individual segments being announced as available for query by each historical server as it loads them. |
| Historical | __org.apache.druid.server.coordination.SegmentLoadDropHandler__<br>As well as seeing how the Historical checks its local segment cache on startup, you can watch as the Historical picks up the instructions from the Coordinator and then does something about them. When there are ERRORs like “Failed to load segment for dataSource”, you get traces about what the process was trying to do – quite often something pointing to an error with its connection to deep storage. |