jensvoid edited this page Dec 5, 2015 · 55 revisions

/// under construction ///

Requirements

LORG requires a recent PHP installation (5.3 or greater) within an UNIX-like environment (Linux, xBSD). On Debian and its derivates (Ubuntu, Kali), you can simply install the CLI version like this:

$ sudo apt-get install php5-cli

Installation

The easiest way to install LORG is to clone the GitHub repository:

$ git clone "https://github.com/jensvoid/lorg"

Getting started

For the most simple usage, and logfile format auto-detection try:

$ ./lorg /path/to/access_log

Since this will probably not meet your requirements, LORG is highly customizable. A robust default usage with PHPIDS-based attack detection, attack quantificationn based on 'bytes-sent' outliers, DNSBL and hostname lookups, geotargeting and url-decoding might be somthing like:

$ ./lorg -d phpids -b all -q bytes -u -g -h /path/to/access_log

Program Usage

Overview

Usage: lorg [-i input_type] [-o output_type] [-d detect_mode]
            [-a add_vector] [-c client_ident] [-b dnsbl_type]
            [-q quantification] [-t threshold] [-v verbosity]
            [-n] [-u] [-h] [-g] [-p] input_file [output_file]

 -i allowed input formats: common combined vhost logio cookie
 -o allowed output formats: html json xml csv
 -d allowed detect modes: chars phpids mcshmm dnsbl geoip all
 -a additional attack vectors: path argnames cookie agent all
 -c allowed client identfiers: host session user logname all
 -b allowed dnsbl types: tor proxy zombie spam dialup all
 -q allowed quantification types: status bytes replay all
 -t threshold level as value from 0 to n (default: 10)
 -v verbosity level as value from 0 to 3 (default: 1)
 -n do not summarize results, output single incidents
 -u urldecode encoded requests (affects reports only)
 -h try to convert numerical addresses into hostnames
 -g enable geotargeting (separate files are needed!)
 -p perform a naive tamper detection test on logfile

Input formats

The input logfile format can be set with the -i parameter. Currently supported formats are:

  • common: NCSA Common Log Format (CLF)
   %h %l %u %t \"%r\" %>s %b
  • combined: common + referer + user agent
   %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"
  • vhost: combined + canonical server name
   %v %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"
  • logio: combined + mod_logio bytes-sent/recv
   %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\ %I %O"
  • cookie: combined + contents of cookie(s)
   %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{Cookie}i\"

Additional, custom formats can be easily defined in Apache mod_log_config-style syntax in $allowed_input_types. If no input logfile format is given, LORG will try to 'guess' the format by matching against corresponding regular expressions.

Output formats

The output format for reports can be set with the -o parameter. Currently supported formats are:

  • html: HTML-file, report is realized as HTML table (with graphics and fancy additional information, if summerization is enabled)
  • json: JSON-file, can be read by the SIMILE exhibit widget, if saved as simile/exhibit/*/results.xml (see simile/README)
  • xml: XML-file, can be read by the SIMILE timeline widget, if saved as simile/timeline/results.xml (see simile/README)
  • csv: CSV file, sparated by ; with " as text delimiter; can be process by spreadsheet programs like Gnumeric or LibreOffice Calc

If no output format is given, html will be used as default.

Detect modes

The attack detection mode can be set with the -d parameter. Currently implemented modes are:

PHPIDS

Documentation missing, please refer to this PDF for the moment.

use case: always a good choice since signatures/regex are a very robust approach...

CHARS

Documentation missing, please refer to this PDF for the moment.

use case: some amount of (non-poisoned) trainig data available and low character distribution variance in the legitimate dataset

MCSHMM

Documentation missing, please refer to this PDF for the moment.

use case: high quantities of (non-poisoned) training data available

GEOIP

Documentation missing, please refer to this PDF for the moment.

use case: use only if legitimate access is restricted to certain geographical areas/hotspots

DNSBL

Documentation missing, please refer to this PDF for the moment.

use case: use only if we have a really good DNS blackhole list already running

Attack vectors

Given a request like /path?argname=value LORG by default only scans values (e.g. /include.php?file=../../etc/passwd) of URL query strings, as they usually contain attacks. This is due to performance reasons and limiting the number of false positives. However, additional attack vectors can be defined with the -a parameter. Currently supported vectors are:

  • path: URL query path (e.g. /scripts/..%c1%9c../winnt/system32/cmd.exe?/c+dir)
  • argnames: URL query string field (e.g. user=0&+union+all+select+1,@@VERSION--+=1)
  • cookie: content of cookies, if available (e.g. template=<script>alert(1);</script>)
  • agent: user agent string, if available (e.g. Mozilla/5.0 (<?php system("id"); ?>))
  • all: check all

Client identfiers

In many cases, we do not have to deal with a single attack, but with a whole series of incidents stemming from the same origin. The adequate feature to identify clients can be chosen with the -c parameter. Currently supported identifiers are:

  • host: This is the Remote-Host field found in Apache logfiles. It contains the IP address or hostname of the network subscribers initially sending the request. Using host as client identifiers has proven to be a good choice in most scenarios. Problems arise if several clients access a website over the same proxy or if NAT is used. In such cases we cannot distinguish visitors depending on their IP address/hostname alone.
  • logname: This is the Remote-Logname field found in Apache logfiles. It contains the result of identification protocol (ident) queries as defined in [Joh93], which is the name of the user who ran the corresponding TCP connection. Since ident is rarely used on the today’s Internet, and as the ident response can be arbitrarily set by the remote system, using logname as client identifier is not advised.
  • user: This is the Remote-User field found in Apache logfiles. It contains the user-name after successful HTTP authentication as defined in section 11 of [BLFF96]. Unfortunately, modern web applications bring their own various mechanisms of authentication instead of applying HTTP-authentication. Using user as client identifier is therefore not advised, unless the web server requires HTTP authentication.
  • session: Session IDs embedded in cookies or GET/POST queries are the typical identification attribute used by web applications. LORG tries to retrieve session ID tokens from logged requests and – if present – cookie information. Depending on the web application and on the server-side scripting language in use, the token to identify a session ID might differ. The following common (case-insensitive) keywords to search for are defined in $session_identifiers and can be adopted to your needs: SID, SESSID, PHPSESSID, JSESSIONID and ASP.NET_SessionId.

If no client identfier is given, host will be used as default.

Attack quantification

-q allowed quantification types: status bytes replay all

Documentation missing, please refer to this PDF for the moment.

Threshold

-t threshold level as value from 0 to n (default: 10)

Documentation missing, please refer to this PDF for the moment.

(Note: the threshold value set here is also the result in DNSBL detection mode)

Verbosity level

-v verbosity level as value from 0 to 3 (default: 1)

Documentation missing, please refer to this PDF for the moment.

Summerization

-n do not summarize results, output single incidents

Documentation missing, please refer to this PDF for the moment.

URL-decoding

-u urldecode encoded requests (affects reports only)

Documentation missing, please refer to this PDF for the moment.

DNS lookups

-h try to convert numerical addresses into hostnames

Documentation missing, please refer to this PDF for the moment.

DNSBL lookups

An attacker might try to hide his identity behind open proxies, botnets or other kinds of middleman nodes. Noticing this kind of obfuscation is important – for instance if legal action is taken into consideration. It should, however, be avoided to mistakenly blame the operator of an anonymizing service. The concept of providing Real-time Blackhole List (RBL) was invented by militant anti-spam activist Paul Vixie back in 1997 and later combined with the Domain Name System by Eric Ziegast. DNS-based blackhole lists (DNSBL) as documented in [Lev10] offer a way to detect whether an IP address has already been ‘conspicuous’ in the past, e.g. as a source of spam. The concept is simple: A DNS request for the questionable IP address is made to a DNSBL provider – if it resolves (A-record), the address is blacklisted there. Although DNSBL was originally introduced to identify e-mail spammers, there are blackhole lists for various purposes nowadays. For example, one can ask the list tor.dnsbl.sectoor.de whether a client is routed through a node of the Tor anonymizing network. Similar lists exist for open proxies (HTTP, SOCKS, etc.) or remote-controlled, trojan horse infected computers (so-called ‘zombies’ often acting as part of a botnet) or dialup and dynamic ip ranges.

DNSBL lookups of host considered as malicious can be enabled and a type of DNS blackhole list chosen with the -b parameter. Currently supported lists are:

  • `tor``: tor.dnsbl.sectoor.de
  • proxy: dnsbl.proxybl.org, http.dnsbl.sorbs.net, socks.dnsbl.sorbs.net
  • zombie: xbl.spamhaus.org, zombie.dnsbl.sorbs.net,
  • spam: b.barracudacentral.org, spam.dnsbl.sorbs.net, sbl.spamhaus.org
  • dialup: dyn.nszones.com
  • all: check all types of blacklists

Geotargeting

Geotargeting of attackers can be enabled with the -g parameter. This might be interesting if the origin of malicious requests should be tracked to a certain country or city. To integrate geotargeting into LORG, the publicly available (CC) database GeoLite City by Maxmind is used, which can be found in the geoip/ directory. Since GeoIP-lookups are done local, performance decreases are almost negligible.

Tamper test

After breaking into a computer system, the attacker will likely try to scrub the log files to cover her tracks. However, non-tampered log data is necessary for any post-attack forensics. Therefore a simple anomaly check against the input logfile, which may at least detect rough tamper, can be run using the -p parameter. It can identify a ‘loss’ of data within the logfile by searching for overly long time slots with no activity at all. This is done by a one-sided Grubbs’ outlier test on the logfile's inter-request time delays. Keep in mind that outliers can be caused by normal phenomena (temporarily disabled log services, network downtimes, clock changes, no company access at Christmas etc.), which is why this naive ‘completeness test’ is error-prone. Also, it will only detect large-scale truncations within the log data.

Input file

The name of the input logfile to parse. This argument MUST be given.

Output file

An output file name for the report generated by LORG can be given as the very last argument. It SHOULD have a file extension corresponding to the output file format (e.g. html). If no output file name is given, a file named report_input-file_date.format will be written within the current working directory.

FAQ

Q: What does LORG stand for?
A: Emm... 'Logfile Outlier Recognition and Gathering'. Also it is an old Irish word ([ˈl̪ˠɔɾˠə]) for trace, track, trail.

Q: Why would anyone call a programm LORG?
A: Unfortunately all the other cool names had already been taken.

Q: What is LORG all about?
A: It's a PHP-CLI programm that implements various detection techniques to automatically scan your HTTPD logfiles for attacks against web applications.

Q: A CLI programm? Why in hell use PHP?
A: At the beginning it seemed a good idea, because PHPIDS could be easily integrated. As time went by things got bigger than expected...

Q: Will it prevent my web applications from getting hacked?
A: Not at all! LORG is designed to detect intrusion attempts within the web server's logfiles (ergo after the incident already happend). If need intrusion prevention on the web application level, try mod_security.

Q: What logfile formats are supported?
A: Out of the box, common, combined (Apache, nginx) and some other formats are supported. All mod_log_config-compatible formats like 'custom' => '%h %l %u %t \"%r\" %>s %b %{X-Forwarded-For}' will do, if defined in $allowed_input_types in the code.

Q: What about W3C-extended (IIS) log file formats?
A: Convert, using e.g. rconvlog

Q: Shouldn't we have a look at Apache's error_log files, in addition to access_logs?
A: From a forenscis perspective: absolutely! At the moment, automated parsing and interpretation of error_log files is not implemented, however this might become a feauture in the future.

Q: Attacks carried out via HTTP POST remain undetected as they are not logged by Apache! How to log POST data?
A: The lack of POST data is a major drawback for post-attack forensics. You can either separately log POST data with mod_dumpost/mod_dumpio/mod_security or gently ask the Apache developers to introduce a new mod_log_config format string this.

Q: How to analyze various separate logfiles at once?
A: If you have several access.log.*.gz files, try something like gunzip access.log.*.gz && cat access.log.* > merged.log or use a tool like logmerge.

Q: How to whitelist/exclude certain noisy clients (e.g. legimimate pentesting security scanners) from detection?
A: grep -v is your friend.

Q: Why does the urldecode switch (-u) have no effect on detection results?
A: The -u switch only affects visualization (= output file). For detection, all HTTP requests are automatically url-decoded before processing.

Q: Is it possible to output all incidents, including harmless ones?
A: Yes. Use -t 0.

Q: How fast is LORG?
A: LORG's performance is primarily dependend on the selected detect mode (-d). While the 'chars' mode performs acceptable (about 50.000 loglines per minute), advanced learning algorithms like 'mcshmm' will take much longer (while beeing more accurate). To speed up processing, try to set $only_check_webapps = true and do not use any additional attack vectors (-a) or DNS/DNSBL (-h, -b) lookups, as they can be performance killers.

Q: How much memory does LORG require?
A: LORG was written with low memory in mind. Depending on the selected detection modes it might still require up to the size of the processed logfile in the worst case. If summarization is disabled (-n), LORG should not require more than 4MB of memory as all loglines are parsed, analyzed and directly written to the output file without caching.

Q: My logfiles are b0ring. Where do i get some more interesting datasets of real-world attacks to do some research on?
A: It's difficult to get publicly available HTTP attack datasets. If you know any, let my know. The author of LORG is aware of Honeynet Project's Scan31 and the CDX 2011 dataset. If you're bored to death, you might just want to ask google.

Q: What other software is out there, to detect attacks within HTTPD logfiles?
A: The author of LORG is aware of the following tools: webforensik, py-scalp, ida, InterScout, hmm-web, detect-http-attack, scan_log.py, hack-attempt-identifier and httpdwatch. If you any more programs designed for the job, please let me know.

Q: Will it work on Windows?
A: No.

Known Issues

  • To make the PHPIDS detection mode work, a symlink IDS -> . needs to be created in the phpids/ folder
  • Geotargeting is incompatible with the php5-geoip Debian/Ubuntu package
  • When showing the results within an SIMILE Widget (worldmap or timeline), your web browser might hang if several thousand or more results are to be displayed
  • And many, many more...
Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.