Permalink
Switch branches/tags
Nothing to show
Find file
Fetching contributors…
Cannot retrieve contributors at this time
749 lines (597 sloc) 22.1 KB
<html>
<head>
<title>Visitors - fast web log analyzer</title>
<link rel=stylesheet href="visitors.css">
</head>
<body>
<center>
<table border="0" width="60%" cellpadding="0" cellspacing="0">
<tr><td align="center"><img width="247" height="56" src="visitors.png" alt="VISI
TORS" title="Visitors web log analyzer"></tr><tr><td align="center"></td></tr>
<tr>
<td align="center" class="maintitle">Visitors, on line documentation for 0.7</td>
</tr>
<tr>
<td align="center"><a href="http://www.hping.org/visitors">(click here for the home page)</a></td>
</tr>
<tr>
<td>
<A HREF="#toc">Table of Contents</A><P>
<H2><A NAME="sect0" HREF="#toc0">Name</A></H2>
visitors - a fast web server log analyzer
<H2><A NAME="sect1" HREF="#toc1">Synopsis</A></H2>
<B>visitors</B> [<I>options]
<I>&lt;filename&gt; [<I>&lt;filename&gt; ...]
<H2><A NAME="sect2" HREF="#toc2"></I></I></I>Description</A></H2>
<P>
<I>Visitors</I> generates access statistics from
specified web log files. <P>
The resulting reports contain a number of useful
informations and statistics:
<DL>
<DD>Requested pages </DD>
<DD>Requested images </DD>
<DD>Referers
by number of visits and age </DD>
<DD>Unique visitors in each day </DD>
<DD>Page views
per visit </DD>
<DD>Pages accessed by the Google crawler (and the date of google's
last access on every page) </DD>
<DD>Pages accessed by the AdSense crawler (and
the date of adsense's last access on every page) </DD>
<DD>Percentage of visits
originated from Google searches for every day </DD>
<DD>User navigation patterns
(web trails) </DD>
<DD>Keyphrases used in Google searches </DD>
<DD>Human languages
used in google searches </DD>
<DD>User agents </DD>
<DD>Weekdays and Hours distributions
of accesses </DD>
<DD>Weekdays/Hours combined bidimensional map </DD>
<DD>Month/Day
combined bidimensional map </DD>
<DD>Visual path analysis with Graphviz </DD>
<DD>Operating
systems, browsers and domains popularity </DD>
<DD>Visitors screen resolution
and color depth </DD>
<DD>404 errors </DD>
</DL>
<P>
The web log files don't need to follow a
strict format, except: the date MUST be included between [ and ] chars,
the client hostname MUST be the first entry in the log, referers and requests
MUST be included between double quote chars. Out of the box Apache log file
will work without problems. <P>
It's possible to use Visitors with IIS log files
converting them using the iis2apache.pl utility distributed with Visitors
(The utility is the same you can find at <A HREF="http://www.jammed.com/~jwa/hacks/">http://www.jammed.com/~jwa/hacks/</A>
and is distributed under the GPL license). <P>
Note that logfile can be a -
character to use the standard input. <P>
<H3><A NAME="sect3" HREF="#toc3">Available options:</A></H3>
<DL>
<DT><B>-A --all</B> </DT>
<DD>Activate all
the optional reports. This option is equivalent to <B>-GKUWRDOB.</B> Note that <B>--trails</B>
is not implicitly included in this option because it also requires <B>--prefix.</B>
See the <B>--trails</B> option documentation for details. </DD>
</DL>
<P>
<DL>
<DT><B>-T --trails</B> </DT>
<DD>Enable the Web
Trails feature. The report will show what are the more frequent moves between
pages of your site. This option requires the <B>--prefix</B> option to work. </DD>
</DL>
<P>
<DL>
<DT><B>-G --google</B>
</DT>
<DD>Activate two reports about pages accessed by the Google and Adsense web
crawlers. Pages are shown ordered accordingly to the last time the Google
web crawler requested the page. The first page shown is the latest that
was accessed. </DD>
</DL>
<P>
<DL>
<DT><B>-K --google-keyphrases</B> </DT>
<DD>Activate a report that shows common search
keyphrases used to found your web site from Google. </DD>
</DL>
<P>
<DL>
<DT><B>-Z --google-keyphrases-age</B>
</DT>
<DD>Activate a report that shows common the lastest keyphrases used to found
your site from Google. </DD>
</DL>
<P>
<DL>
<DT><B>-H --google-human-language</B> </DT>
<DD>Activate a report that shows
common human languages used to serach from Google. This feature uses the
'hl' variable of the Google referer URL. </DD>
</DL>
<P>
<DL>
<DT><B>-U --user-agents</B> </DT>
<DD>Show information about
common user agents. </DD>
</DL>
<P>
<DL>
<DT><B>-W --weekday-hour-map</B> </DT>
<DD>Activate the generation of a combined
weekdays/hours bidimensional map that shows information about traffic in
every 168 different hours of a 7 days week. Brighter colors mean higher
traffic. This is ideal to figure what's the best moment on a week for a maintenance
downtime, what's the target of the site, if people are accessing it from
work or from home, and so on. The map is generated as pure html inside the
report. </DD>
</DL>
<P>
<DL>
<DT><B>-M --month-day-map</B> </DT>
<DD>Activate the generation of a combined month/day bidimensional
map that shows information about traffic in every 365 different days of
the year. Brighter colors mean higher traffic. This is useful in order to
figure with a quick look traffic trends and days with particuarly high
or low traffic. The map is generated as pure html inside the report. </DD>
</DL>
<P>
<DL>
<DT><B>-R --referers-age</B>
</DT>
<DD>Shows referers ordered by age. The 'age' of a referer is the date it appeared
the first time. In the report, newer referers are on top. This report is
useful to check for new external links. </DD>
</DL>
<P>
<DL>
<DT><B>-D --domains</B> </DT>
<DD>Activate the generation
of information about Top Level Domains popularity. This information may
be useful to guess the amount of visits from different countries. Note that
Visitors will not resolve numerical IP addresses if they are not already
resolved in the log file. All the unresolved IP addresses will be shown
in this report under the entry Unresolved IP. </DD>
</DL>
<P>
<DL>
<DT><B>-O --operating-systems</B> </DT>
<DD>Activate
the report about Operating Systems popularity, sorted by number of accesses.
All the common operating systems are listed in the report, while unknown
operating systems will be summed in the unknown entry. </DD>
</DL>
<P>
<DL>
<DT><B>-B --browsers</B> </DT>
<DD>Activate
the report about Browsers popularity, sorted by number of accesses. All
the common browsers are listed in the report, while unknown browsers will
be summed in the unknown entry. Browsers are listed by family (for example
Internet Explorer, Opera, and so on), and not by specific version. </DD>
</DL>
<P>
<DL>
<DT><B>-X --error404</B>
</DT>
<DD>Activate the generation of missing documents (404 error) report. This report
will show files requested, but missing, ordered by number of requests. The
report is useful in order to discover if for some mistake there is some
file missing in the web site, but often you will see bizarre requests performed
by users or internet worms and security scans. </DD>
</DL>
<P>
<DL>
<DT><B>-Y --pageviews</B> </DT>
<DD>Activate the
generation of a report that shows (and approximation) of the percentage
of pages viewed per unique visit. The goal of this report is to understand
the usage pattern of the site and the level of interest of the visitors.
For example, in a site that provides a number of pages with interesting
contents, the percentage of visitors performing a single page view per
visit is probably searching for something else. </DD>
</DL>
<P>
<DL>
<DT><B>-S --robots</B> </DT>
<DD>Activate the generation
of a report that shows user agents of clients requesting the file robots.txt,
with the exception of the MSIE Crawler requests. The result is a list of
web robots and spieders that accessed your web site, ordered by number
of requests of robots.txt. </DD>
</DL>
<P>
<DL>
<DT><B>--screen-info</B> </DT>
<DD>Activate the screen resolution and
color depth reports. Note that for this report to work you have to insert
on your HTML pages the javascript code you can find in the README file
in the visitors tarball. </DD>
</DL>
<P>
<DL>
<DT><B>--stream</B> </DT>
<DD>Enable the Stream Mode (see the <B>STREAM
MODE DETAILS</B> section for more information). Shortly: when in stream mode
<I>Visitors</I> will process all the log files specified (possibly none, that's
valid in this mode) as usual, producing the report. Then the stream mode
is entered and <I>Visitors</I> will start to read from standard input for a continuous
stream of web logs, updating the statistics incrementally as new data is
available. A new report is produced periodically if new data arrived, accordingly
to the <B>--update-every</B> option (default is to update the statistics every ten
minutes). It's possible to ask <I>Visitors</I> to reset the statistics after some
period of time using the <B>--reset-every</B> option. This allows to have a snapshot
of what is going on in the last five minutes, hour, day or week. Note that
<B>--stream</B> requires <B>--output-file</B> because <I>Visitors</I> needs to overwrite the report
for every update, so can't output to standard output as usually. If you
plan to use the stream mode, also check the <B>--tail</B> option. </DD>
</DL>
<P>
<DL>
<DT><B>--update-every</B><I> seconds</I>
</DT>
<DD>By default in Stream Mode statistics are updated every 10 minutes. This
option specifies a different period in seconds. </DD>
</DL>
<P>
<DL>
<DT><B>--reset-every</B><I> seconds</I> </DT>
<DD>By default
in Stream Mode statistics are never reset, but continuously updated incrementally.
This option specifies to reset statistics after the given amount of time
in seconds. This is useful to have a snapshot of the web site usage. </DD>
</DL>
<P>
<DL>
<DT><B>-f --output-file</B><I>
file</I> </DT>
<DD>Write output to <I>file</I> instead of stdout. </DD>
</DL>
<P>
<DL>
<DT><B>-m --max-lines</B><I> number</I> </DT>
<DD>Set the max
number of entries that should be shown in reports like referers, keyphrases
and so on. This option sets all the reports max number of entries for all
the reports at once. </DD>
</DL>
<P>
<DL>
<DT><B>-r --max-referers</B><I> number</I> </DT>
<DD>Set the max number of entries
in the referer report. </DD>
</DL>
<P>
<DL>
<DT><B>-p --max-pages</B><I> number</I> </DT>
<DD>Set the max number of entries in
the accessed pages report. </DD>
</DL>
<P>
<DL>
<DT><B>-i --max-images</B><I> number</I> </DT>
<DD>Set the max number of entries
in the accessed images report. </DD>
</DL>
<P>
<DL>
<DT><B>-x --max-error404</B><I> number</I> </DT>
<DD>Set the max number of
entries in the missing documents report. </DD>
</DL>
<P>
<DL>
<DT><B>-u --max-useragents</B><I> number</I> </DT>
<DD>Set the
max number of entries in the user agents report. </DD>
</DL>
<P>
<DL>
<DT><B>-t --max-trails</B><I> number</I> </DT>
<DD>Set
the max number of entries in the web trails report. </DD>
</DL>
<P>
<DL>
<DT><B>-g --max-googled</B><I> number</I>
</DT>
<DD>Set the max number of entries in the crawled pages report (google bot).
</DD>
</DL>
<P>
<DL>
<DT><B> --max-adsensed</B><I> number</I> </DT>
<DD>Set the max number of entries in the crawled pages
report (adsense bot). </DD>
</DL>
<P>
<DL>
<DT><B>-k --max-google-keyphrases</B><I> number</I> </DT>
<DD>Set the max number of
entries in the Google keyphrases report. </DD>
</DL>
<P>
<DL>
<DT><B>-a --max-referers-age</B><I> number</I> </DT>
<DD>Set the
max number of entries in the referers by date report. </DD>
</DL>
<P>
<DL>
<DT><B>-d --max-domains</B><I> number</I>
</DT>
<DD>Set the max number of entries in the domains report. </DD>
</DL>
<P>
<DL>
<DT><B>-P --prefix</B><I> number</I> </DT>
<DD>Prefixes
specify to visitors how a link should look like to be classified as internal
to your site. This option is required for <B>--trails</B> and will also have the
nice effect to avoid that internal links are shown in the referers report.
If you are analyzing statistics for <A HREF="http://www.your.site.com/,">http://www.your.site.com/,</A>
just use: <B>--prefix
<P>
<A HREF="http://www.your.site.com">http://www.your.site.com</B></A>
<P>
If your site is reachable using more hostnames you
should specify all these, like in the following example: <BR>
<B>--prefix <A HREF="http://www.your.site.com">http://www.your.site.com</A>
--prefix http://your.site.com</B> </DD>
</DL>
<P>
<DL>
<DT><B>-o --output</B><I> html|text</I>
</DT>
<DD>Output module. You can use text or html. The default is html. </DD>
</DL>
<P>
<DL>
<DT><B>-V --graphviz</B> </DT>
<DD>This
option enables the Graphviz mode: <I>Visitors</I> will analyze the log file and
create a graph describing the access patterns of your web site. The information
used to create the graph is the same as the web trails report (that you
can enable with --trails), but as a graph it can be more readable for non
trivial sites. An example on how to use this feature: <P>
% visitors access.log
--prefix <A HREF="http://www.hping.org">http://www.hping.org</A>
--graphviz &gt; graph.dot<BR>
<P>
% dot /tmp/graph.dot -Tpng &gt; graph.png <P>
On Debian systems, the <B>dot</B> command
is included in the <B>graphviz</B> package. The generated graph will have edges
of different colors, from blue to red to specify a low to high level of
popularity of a given movement from one page to another of the web site.
This option requires one or more <B>--prefix</B> options in order to work, just
like the <B>--trails</B> option. </DD>
</DL>
<P>
<DL>
<DT><B>-V --graphviz-ignorenode-google</B> </DT>
<DD>Don't put the google node
on the generated graph. Only useful with <B>--trails</B> </DD>
</DL>
<P>
<DL>
<DT><B>-V --graphviz-ignorenode-external</B>
</DT>
<DD>Don't put the external referer node on the generated graph. Only useful with
<B>--trails</B> </DD>
</DL>
<P>
<DL>
<DT><B>-V --graphviz-ignorenode-noreferer</B> </DT>
<DD>Don't put the node indicating requests
without referer on the generated graph. Only useful with <B>--trails</B> </DD>
</DL>
<P>
<DL>
<DT><B>--tail</B> </DT>
<DD>When
this option is specified <I>Visitors</I> will emulate the Unix command tail -f
--max-unchanged-stats=1 -q. You can specify the log file names to monitor for
changes, once new data is appended in any of the specified file, visitors
will output the new data to the standard output. This option is useful conjunction
to the Stream Mode (--stream). Files can be log-rotated because <I>Visitors</I> in
Tail Mode will always try to reopen the file to check for changes. </DD>
</DL>
<P>
<DL>
<DT><B>--time-delta</B><I>
delta</I> </DT>
<DD>If your web server is in a different timezone than most of your visitors
or yourself, you will notice a shift in the reports regarding time and
days of week. By default, <I>Visitors</I> will generate output using the host's
locale. You can use the <B>--time-delta</B> option in order to adjust the output. Positive
values will shift on the right (toward future) from the given number of
hours, negative values will shift on the left (toward past). In the future
this option may have support to directly specify the output timezone. </DD>
</DL>
<P>
<DL>
<DT><B>--filter-spam</B>
</DT>
<DD>Filter referer spam using a keyword-based filter (see blacklist.h for more
information on keywords). If you don't know what referer spam is check this
Wikipedia page: <A HREF="http://en.wikipedia.org/wiki/Referer_spam">http://en.wikipedia.org/wiki/Referer_spam</A>
</DD>
</DL>
<P>
<DL>
<DT><B>--ignore-404</B> </DT>
<DD>When
this option is turned on log lines with 404 errors are just used to generate
the 404 errors report and not used for other reports. </DD>
</DL>
<P>
<DL>
<DT><B>--grep</B><I> pattern</I> </DT>
<DD>Process
only log lines matching the specified pattern. Patterns are matched using
the glob-style matching (the one used by the unix shell): <blockquote></DD>
<DT><B>*</B></DT>
<DD>Matches any sequence
of characters in <I>string</I>, including a null string. </DD>
<DT><B>?</B></DT>
<DD>Matches any single character
in <I>string</I>. </DD>
<DT><B>[<I>chars<B>]</B></I></B></DT>
<DD>Matches any character in the set given by <I>chars</I>. If a
sequence of the form <I>x<B>-<I>y</I></B></I> appears in <I>chars</I>, then any character between <I>x</I>
and <I>y</I>, inclusive, will match. </DD>
<DT><B>\<I>x</I></B></DT>
<DD>Matches the single character <I>x</I>. This provides
a way of avoiding the special interpretation of the characters <B>*?[]\</B> in
<I>pattern</I>. </DD>
</DL>
</blockquote>
For default matching is performed in a case sensitive way, but
case insensitive matching may be forced prefixing the pattern with the
string <B>cs:</B>, so for example the pattern <B>cs:firefox</B> will match all the log
lines containing the string firefox, FireFox, FIREFOX and so on. <P>
<DL>
<DT><B>--exclude</B><I>
pattern</I> </DT>
<DD>Works exactly like <B>--grep</B>, but only lines NOT matching the specified
pattern are processed. Note that --grep and --exclude can be used multiple times,
and are processed sequentially. For example <B>visitors --grep firefox --exclude
download</B> will process only lines including the string firefox but not including
the string download. </DD>
</DL>
<P>
<DL>
<DT><B>--debug</B> </DT>
<DD>Show additional information on errors. For example
invalid lines are printed on the standard error if found. Mainly useful
for developers and error reporting. </DD>
</DL>
<P>
<DL>
<DT><B>-h --help</B> </DT>
<DD>Show usage and copyright information.
</DD>
</DL>
<P>
<DL>
<DT><B>-v --version</B> </DT>
<DD>Show program version. </DD>
</DL>
<H2><A NAME="sect4" HREF="#toc4">Examples</A></H2>
The simplest usage, to be used interactively
when you have a web log to check (for example over ssh in your web server),
just use: <P>
% visitors access.log | less <P>
That will produce a human readable
output in text only. To generate html web stats with much more information
you may use instead this: <P>
% visitors --output text -A -m 30 access.log -o html
&gt; report.html <P>
If you want information on the usage patterns for your site
you must provide the url prefix of your web site, and specify the <B>--trails</B>
option. The next example produces an HTML report with usage patterns information.
<P>
% visitors -A -m 30 access.log --trails --prefix <A HREF="http://www.hping.org">http://www.hping.org</A>
&gt; report.html<BR>
<P>
Note that it's ok to specify multiple file names, or to provide the input
using the standard input like in the following two examples: <P>
% visitors
/var/log/apache/access.log.* <BR>
% zcat access.log.*.gz | visitors - <P>
<H2><A NAME="sect5" HREF="#toc5">Stream Mode Details</A></H2>
<P>
The usual way to run
<I>Visitors</I> is to specify some option to control the report generation, and
the name of log files. For example to generate a report from two Apache's
access log files you can write: <P>
% visitors -A access.log.1 access.log.2 &gt; report.html
<P>
<I>Visitors</I> will analyze the log files, and will output the report. Sometimes
it can be more interesting to have web statistics updated continuously,
almost in real time, as new data is available. In order to provide this
feature <I>Visitors</I> implements a mode called Stream Mode that reads a stream
of logs from the standard input. The following command line shows how to
use it (but check the --stream option documentation for more information).
<P>
% tail -f /var/log/apache/access.log | visitors --stream -A --update-every 60
\<BR>
--output-file /tmp/report.html<BR>
<P>
<I>Visitors</I> will incrementally update the statistics as new logs are available
and will update the html report every 60 seconds. As you can see in this
mode is required to specify the report file name using the <B>--output-file</B> option
because <I>Visitors</I> needs to overwrite the report to update it. Note that instead
of the tail command in the above example it is possible to use instead
<I>Visitors</I> in Tail Mode (an emulation for the tail program): <P>
% visitors
--tail /var/log/apache/access.log | visitors --stream -A --update-every 60 \<BR>
--output-file /tmp/report.html<BR>
<P>
It's possible to generate real time statistics about the last N seconds
of web traffic, where N is configurable and can be from few seconds to
one week or more, using the <B>--reset-every</B> option. The following example generates
statistics updated every 30 seconds about the last hour of traffic: <P>
%
visitors --tail /var/log/apache/access.log | visitors --stream -A --update-every
30 --reset-every 3600 \<BR>
--output-file /tmp/report.html<BR>
<H2><A NAME="sect6" HREF="#toc6">Authors</A></H2>
<P>
<I>Visitors</I> was written by Salvatore Sanfilippo &lt;antirez@invece.org&gt;.
<H2><A NAME="sect7" HREF="#toc7">Copying</A></H2>
Copyright (C) 2004,2005 Salvatore Sanfilippo &lt;antirez@invece.org&gt;. <P>
<I>Visitors</I>
is distributed under the GNU General Public License. <P>
This manual page was
written (based on the original HTML documentation) by Romain Francoise
&lt;rfrancoise@debian.org&gt; for the Debian GNU/Linux system, but may be used by
others. Salvatore Sanfilippo updated this man page starting from Visitors
0.5, this manual page is now part of the Visitors tarball. <P>
<HR><P>
<A NAME="toc"><B>Table of Contents</B></A><P>
<UL>
<LI><A NAME="toc0" HREF="#sect0">Name</A></LI>
<LI><A NAME="toc1" HREF="#sect1">Synopsis</A></LI>
<LI><A NAME="toc2" HREF="#sect2">Description</A></LI>
<UL>
<LI><A NAME="toc3" HREF="#sect3">Available options:</A></LI>
</UL>
<LI><A NAME="toc4" HREF="#sect4">Examples</A></LI>
<LI><A NAME="toc5" HREF="#sect5">Stream Mode Details</A></LI>
<LI><A NAME="toc6" HREF="#sect6">Authors</A></LI>
<LI><A NAME="toc7" HREF="#sect7">Copying</A></LI>
</UL>
</td>
</tr>
</table>
<br>
<small>Copyright (C) 2005 Salvatore Sanfilippo -- All Rights Reserved</small>
<br><br>
</center>
<!-- Search Google -->
<center>
<FORM method=GET action='http://www.google.com/custom'>
<TABLE bgcolor='#FFFFFF'><tr><td nowrap='nowrap' valign='top' align='center' height='32'>
<A HREF='http://www.google.com/'>
<IMG SRC='http://www.google.com/logos/Logo_25wht.gif' border=0 ALT='Google' align='absmiddle'></A>
<INPUT TYPE=text name=q size=25 maxlength=255 value=''>
<INPUT type=submit name=sa VALUE='Search'>
<input type=hidden name=client value='pub-6259700433441119'>
<input type=hidden name=forid value='1'>
<input type=hidden name=ie value='ISO-8859-1'>
<input type=hidden name=oe value='ISO-8859-1'>
<input type=hidden name=cof value='GALT:#008000;GL:1;DIV:#336699;VLC:663399;AH:center;BGC:FFFFFF;LBGC:336699;ALC:0000FF;LC:0000FF;T:000000;GFNT:0000FF;GIMP:0000FF;FORID:1;'>
<input type=hidden name=hl value='en'>
</td></tr></TABLE>
</FORM>
</center>
<!-- Search Google -->
<body>