(This may be basic, but it is critical to understand)
- WHEN
- WHERE
- WHY
- Context
Understand why failures are occurring in a system
There are two parts: a 'log' and an 'entry' in the log
-
The log:
- a set of events or activities that have occurred over time
-
The log entry:
- Is when something occurred
- Data about the event
- Often includes a ‘severity level'
An Apache Web Server Log:
// host - user [utcdate] “METHOD /url/resource.file protocol” response filesize
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
It answers: What was accessed, when, who, and provides context
(This helps make a distinction between logging and metrics)
Gather detailed information about:
- WHEN an event occurred
- WHAT happened when the event occurred (context)
- WHERE it happened
Logging is not designed to provide these things:
- What the system as a whole is doing
- How much the system is doing
- High level view of system activity
These are the objectives of a metrics system.
Tactic: It is possible to use logging to accomplish these goals. If attempting to stabalize a sytem without logging or metrics, starting with logging can often help answer the questions metrics are designed to answer.
Best Practice: Keep logging separated from metrics. If mixed, and you want to raise the log level in production (only log warnings and errors) you will loose your metrics, which is unacceptable.
-
Fatal - When process has crashed (Full client, NodeJs)
-
Error - Significant error has occurred. Keep an eye on these.
-
Warning - Shouldn't occur, but not sure how to handle (Try/Catch)
-
Information - Comprises most logs, can help understand failures
-
Trace - 'Debugging' in Prod
Example of the contents of a log entry:
entry = {
severity: Fatal | Error | Warn | Log | Trace,
timestamp: UtcDate,
// system the log originates from
source: String,
correlation_id: UUID,
message: String,
payload: Any,
}
- During development we mainly use: Information, Warnings and Errors
- Once operational, maybe only log Warnings
- Debugging Prod, use Trace - understand why the errors are occurring
- In high volume system, only capture Errors.
There are different strategies that can be used to handle entries on the client side:
- Log to Disk, with utility to move logs to more permanent location
- Pros: Reduces network traffic, logs don’t get lost during ‘log service’ outage
- Cons: Increase disk IO and need another job to persist
- Log directly to a service
- Pros: Immediate log feedback (great for debugging)
- Cons: Increased network traffic
- Availability - What happens when log service is unavailable?
- Performance Impact - heavy logging impacts disk and/or network
- Be cautious of hand rolled HTTP solutions: if remote site is unavailable, the connection timeout can cause a PROD system to grind to a halt. (I’ve seen it, twice, on the same system.)
- Industry convention says: "Do not build your own logging library", there are many very good frameworks. They've found all the edge cases and failure paths (See StackOverflow)
- Logging system goes down - For example, pushing all logs to a database could overwhelm a database and so overall performance of the system (or other systems using that logging database could be degraded).
- Log4Net - derived from very popular Java library
- System.Diagnostics.Trace - .Net framework
- GreyLog - logs to disk, separate process sends to a log aggregator service
- Log Database - sends log directly to a database
- Splunk - send logs directly to a log service
- Application Insights - has some ‘logging’ capabilities - not recommended for logging, its the wrong tool.
More research needs to be done in this area to understand industry best practices.
It may be helpful to include more information about the source of the logs, such as:
- Server/Node name - this allows issues with specific nodes to be identified
- Code version
There are various types of logging.
- Application Logging
- Web Server (IIS)
- OS
Different systems may have different needs and slightly different solutions.