Awk for Apache / Nginx logs
Latest commit 595d546 Sep 7, 2015 @lapsedtheorist Added new awk filter to collect anonymised user-agent data
For user-agent analysis of sites where the IP addresses and request
patterns may be privileged information, anonymised UA data must be
obtained. The file ua.awk collates the datetime and useragent data for
every request and includes basic information on the request type and
response status code. The latter should help in determining if
particular user agents have trouble accessing certain services for some

Web server log file analysis & filtering

v1.2; Oct 2012
Ben Carpenter

This awk script processes lines from a log format that matches the 'combined' log often used by the Apache and Nginx web servers. If your log file format is different, amend accordingly, but for reference this is the combined format this script expects by default:

%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-agent}i"

%h      Remote host
%l      Remote logname (ignored)
%u      Remote user (ignored)
%t      Date and time of the request 
%r      First line of the request, typically "GET /something HTTP/1.1"
%>s     Status
%b      Size of response in bytes

It tries to be efficient on resources, so there's minimal progress messages and no system commands in the main loop other than writing to a file based on the status code. The output files are written in a simplified tab-separated format, error corrected for some strange things like spaces in URLs and double quotes for the userid. This revised format is easier to pass reliably through other awk scripts when filtering for specific data, etc. The file format is:

IP, Date/Time, Method, URL, Status, Size, Referer, User Agent

You should be able to send a large (>1GB) amount of log data through this script quite comfortably. This works well for me, but usual clauses apply (use it at your own risk, etc.). Bug reports and suggestions for improvements are very welcome