#Logbox
Logbox is an efficient Hadoop-based tool for tidying up huge amounts of unstructured logs. With Logbox your logs lines can be reshaped and stored into several output locations within a single execution.
Imagine you have a huge unstructured log file at s3://example-logs/example-2013-10-22_00000.lzo a chunck of which looks like this:
2013-10-22 12:12:12 app321 registration userid=12345 source=direct
2013-10-22 12:12:13 app456 payment_event:321 userid=34567 ENJOY $59.00
2013-10-22 12:12:14 app567 some dummy event we don't care about right now
2013-10-22 12:12:14 app567 so simply leave it here
2013-10-22 12:12:13 app456 payment_event:443 userid=12345 PRO $159.00
2013-10-22 12:12:13 app456 registration userid=23456 source=adwords
You want to clean this up and have two separate log files:
- one for the registrations, which will be stored in s3://example-logs/logbox/registration
- an other one for payment records, which will be stored at s3://example-logs/logbox/payment.
Also, you want log lines to be put into a directory nominated by the date the line is associated with.
Finally, you want to get rid of the clutter and transform these lines into better shape.
Let's look at this configuration file (there's more explanation below):
{
"categories": [
{
"input_glob": "example-${date_glob}_*",
"name": "example_category",
"rules": [
{
"name": "registration",
"description": "Separate registrations by source",
"match": "^(?<date>20[0-9\\-]{8}) (?<time>[0-9:]{8}) (?<appserver>[a-z\\-0-9]+) registration userid=(?<userid>[0-9]+) source=(?<source>\\w+)$",
"output_format": "${date} ${time} ${userid} ${source}",
"output_location": "registration/${date}/${input_basename}"
},
{
"name": "payment",
"description": "Separate payment events",
"match": "^(?<date>20[0-9\\-]{8}) (?<time>[0-9:]{8}) (?<appserver>[a-z\\-0-9]+) payment_event:[0-9]+ userid=(?<userid>[0-9]+) (?<license>\\w+) \\$(?<amount>[0-9\\.]+)$",
"output_format": "${date} ${time} ${userid} ${license} ${amount}",
"output_location": "payment/${date}/${input_basename}"
}
]
}
],
"max_line_length" : "100000",
"reducer_number": "40",
"input_filename" : "^(?<name>.*)\\-(?<date>20[0-9\\-]{8})_(?<hour>000[0-2][0-9])$",
"output_name_template" : "${name}-${date}",
"input_compression": "lzo",
"input_location_prefix": "s3://example-logs/",
"output_compression": "lzo",
"output_location": "s3://example-logs/logbox/"
}
After executing Logbox, you will have two new directories with the following content:
s3://example-logs/logbox/registration/2013-10-22/example-2013-10-22:
2013-10-22 12:12:12 12345 direct
2013-10-22 12:12:13 23456 adwords
s3://example-logs/logbox/payment/2013-10-22/example-2013-10-22
2013-10-22 12:12:13 34567 ENJOY 59.00
2013-10-22 12:12:13 12345 PRO 159.00
It's really that simple.
-
Checkout this project
-
Download the dependencies and then build the jar file containing all the dependencies:
mvn clean compile assembly:single
You will find your jar file here:
target/logbox-1.0-SNAPSHOT-jar-with-dependencies.jar
- Local test execution
- Production execution
Parameter | Description |
---|---|
-rd,--run | Run Logbox on Hadoop |
-rt,--local-test | Test Logbox locally |
-cf,--config | path to the config.json file |
-s,--start-date | The date to run Logbox from (YYYY-MM-DD) |
only applies to hadoop execution | |
-e,--end-date | The date to run Logbox until (YYYY-MM-DD) - only applies to Hadoop execution |
only applies to hadoop execution | |
-f,--rule-filters | Comma separated list of the rules to be executed, in the form of . |
By default, each rule is executed | |
-i,--input-location | Path of input file |
Only applies to local test | |
-tc,--test-category | Log category to test in local test |
Only applies to local test |
Take a look at the examples.
In this execution mode Logbox runs locally (without Hadoop) and simulates the execution of the job. It prints the names of the files and the lines within those files to stdout. This way you can check whether your regexps are correct and if you specified the input and output paths correctly.
Executing Logbox in local test mode:
java -jar target/logbox-1.0-SNAPSHOT-jar-with-dependencies.jar --local-test \
--input-location example/example-2013-10-22_00000 \
--test-category example_category \
--config example/config.json
Check out the results:
INFO: Added rule example_category -> registration
INFO: Added rule example_category -> payment
INFO: Using basename: example-2013-10-22
registration/2013-10-22/example-2013-10-22 <- 2013-10-22 12:12:12 12345 direct
payment/2013-10-22/example-2013-10-22 <- 2013-10-22 12:12:13 34567 ENJOY 59.00
payment/2013-10-22/example-2013-10-22 <- 2013-10-22 12:12:13 12345 PRO 159.00
registration/2013-10-22/example-2013-10-22 <- 2013-10-22 12:12:13 23456 adwords
Usage:
hadoop jar target/logbox-1.0-SNAPSHOT-jar-with-dependencies.jar com.prezi.logbox.main.App
--run
-s 2013-07-21
-e 2013-07-23
-cf config.json
The start and end dates specify the time range to process. (If start equals end, you only consider logs from that certain day.)
Create a config file in JSON format. This is the hardest part of working with Logbox :)
The config file should contain the following information:
input_location_prefix
: Where can the input logs be found?output_location
: Where do you want the derived files to be stored?input_compression
andoutput_compression
: What is your input format and what is your desired output format? Logbox supports (splittable) lzo compression, so here you can write either "lzo" or "none". In case you want lzo files as your output, Logbox will create index files as well.input_filename
andoutput_name_template
: The naming convention of your filtered logfiles. You can specify with a regexp how the directories produced by LogBox should look. This can be very useful if you want to aggregate your results, e.g. instead of having files containing logs of a 1-hour timespan, you can store your sorted logs in daily resolution. Why might you need this? As the indexing of lzo files doesn't scale very well, it is not practical having hundreds or thousands of small lzo files.max_line_length
: Maximum line length, all files longer than this number will be omittedreducer_number
: Number of reducers for the MapReduce job- Filtering/Transformation rules: You can define categories. Each category has its own input path, described by the
input_glob
property. In each category, rules can be defined; these filters and transformations will be applied to the files that belong to the category. For each filtering rule you must provide the following properties: - a
name
and adescription
of your rule. This will help your co-workers to understand your intentions. match
: a regular expression, possibly defining named groups, to identify fields of interest in the processed loglineoutput_format
: How should the processed lines look?output_location
: The output directory of this rule.
Take a look at the examples which show you how to set up a simple config.
- When running Logbox you may specify an already existing output directory, but be careful: if the job should write into an already existing file Logbox will overwrite them without notification (unlike a standard hadoob job).
- Logbox will eliminate duplicate log lines from the input files.
Have fun using Logbox!