A highly performant network daemon that runs on the Node.js platform that translates, fans out and/or consolidates any incoming data stream to any outgoing data stream with failover support.
The incoming streams can consist of one or more of UDP, TCP, Unix Socket, File or STDIN
and the outoing streams can consist of one or more of UDP, TCP, Unix Socket or STDOUT
.
Here's a quick graphical overview:
-
remotes
PipeD
will send your data to one or more remote endpoints. It can do aone to one
,many to one
, andmany to many
mapping (which allows for both fan out and consolidation). It also monitors the health of the remotes, failing over to a backup remote as needed. -
listeners
PipeD
can listen for input in any way that you can do cross process communication in your favourite Unix environment. Any input point is called alistener
-
chains A list of failover remotes is considerd a chain. The idea is that at least one node in the chain should be available for delivery, or the chain is marked as sick. This is usually a state you don't want to be in, as you'd be losing (one of) your remote streams.
All configuration options to PipeD
are documented in etc/config.js.example
This file holds all options, their documentation, applications defaults and examples on how to use them. See the Examples section below for some graphical examples as well.
The easiest way to show how PipeD
works is through some examples:
The picture above shows a very simple fanout situation; the access.log
is tailed, and for every line, a copy of that line is sent to both the
stats server
, as well as the db server
Here's what that configuration would look like for PipeD
:
{ // Tail these files, and process each line
files: [
'/var/log/access.log',
],
// Send a copy to both servers
servers: [
[ 'tcp://stats-server.example.com:12345' ],
[ 'tcp://db-inserter.example.com:23456' ],
],
}
The picture above shows the most basic failover scenario; if the host syslog1
is unavailable, send the line to syslog2
instead. If syslog1
becomes available
again, switch traffic back there.
In addition, the picture shows that rather than tailing a log file like in the
example above, we can also directly listen on stdin
, and hook into Apache Customlog functionality.
Here's what that configuration would like for PipeD
:
{ // Listen on stdin
stdin: true,
// Send a copy to at least one of these servers
servers: [
'tcp://syslog1.example.com:12345',
'tcp://syslog2.example.com:12345',
],
}
For more examples for configuring listeners & remotes, please take a look at the etc/config.js.example file shipped in this repository.
This file also holds all the configuration options for PipeD
, their documentation
and defaults.
There are additional config variables available for debugging:
debug
- log common operations to stderrtrace
- log extensive information about all operations to stderr
For more information, check the etc/config.js.example file.
-
Install node.js
-
Clone the project
-
Run
npm install
from the repository root -
Create a config file
from etc/config.js.example
and put it somewhere -
Start the Daemon:
node bin/piped.js --config=/path/to/config
(optional: all config values can be passed on the command line as well)
PipeD
comes with a test harness to exercise it's core features. To run
the test suite, run the following command from the repository root:
$ ./run_tests.sh
PipeD
exposes an Admin interface (by default on port 29030) that you
can telnet to to inspect the health of the system.
Here is a list of supported commands:
-
help Returns a list of supported commands
-
ping Returns 'pong' - useful for basic monitoring/health checks
-
config Returns a
JSON
representation of the config this instance is using -
stats Returns a
JSON
representation of the send/receive statistics and the state of the listeners, chains and remotes.
PipeD
integrates with StatsD for usage
statistics. Simply configure the StatsD
endpoint in your config and watch
the stats on listeners, chains and remotes stream in.
At Krux we deal with a large amount of inbound analytics data (well north of 15k/second) on our beacon nodes. We store this data locally on disk for later batch processing, but we also want to move this data off the host asap for real time processing of certain bits of data.
Originally, we had used rsyslog as the mechanism to move files off the host, but found it had a number of short comings that greatly reduced it's usefulness to us:
-
High CPU overhead At several thousand requests per second, even a basic
forward everything
rule in rsyslog was eating up significant amounts of CPU, leaving less for the actual processes we wanted to run -
Flaky failover When the primary delivery node for rsyslog disappeared, it would often mark the node as sick, but not failover to the fallback host, causing it to overflow it's internal buffer, then disk, and eventually block outgoing traffic.
-
Input conversion As we started to use more piping techniques, not all inputs were in a format that rsyslog could easily consume, or would require us to run multiple rsyslog instances to watch all the inputs.
PipeD
aims to address all the issues above while being small, flexible,
fast and lightweight.
As a comparison, on a test using a c1.medium,
PipeD
was able to handle 85k messages/second over TCP
or UDP with a
single remote endpoint, where rsyslog would run out of CPU at 10k/second.
Interesting detail to the PipeD
benchmark is that the bottlenecks were the
cat
and nc
process that are taking up the majority of the CPU for the
benchmark.
You're interested in contributing to PipeD? AWESOME. Here are the basic steps:
fork PipeD from here: http://github.com/jib/piped
- Clone your fork
- Hack away
- If you are adding new functionality, document it in the README
- If necessary, rebase your commits into logical chunks, without errors
- Push the branch up to GitHub
- Send a pull request to the jib/piped project.
We'll do our best to get your changes in!