Event Streams

Webserver Log Parsing

We’ll represent loglines with the following model definition:

link:code/serverlogs/models/logline--fields.rb[role=include]

Since most of our questions are about what visitors do, we’ll mainly use visitor_id (to identify common requests for a visitor), uri_str (what they requested), requested_at (when they requested it) and referer (the page they clicked on). Don’t worry if you’re not deeply familiar with the rest of the fields in our model — they’ll become clear in context.

Two notes, though. In these explorations, we’re going to use the ip_address field for the visitor_id. That’s good enough if you don’t mind artifacts, like every visitor from the same coffee shop being treated identically. For serious use, though, many web applications assign an identifying "cookie" to each visitor and add it as a custom logline field. Following good practice, we’ve built this model with a visitor_id method that decouples the semantics ("visitor") from the data ("the IP address they happen to have visited from"). Also please note that, though the dictionary blesses the term 'referrer', the early authors of the web used the spelling 'referer' and we’re now stuck with it in this context. ///Here is a great example of supporting with real-world analogies, above where you wrote, "like every visitor from the same coffee shop being treated identically…" Yes! That is the kind of simple, sophisticated tying-together type of connective tissue needed throughout, to greater and lesser degrees. Amy////

Simple Log Parsing

////Help the reader in by writing something quick, short, like, "The core nugget that you should know about simple log parsing is…" Amy////

Webserver logs typically follow some variant of the "Apache Common Log" format — a series of lines describing each web request:

154.5.248.92 - - [30/Apr/2003:13:17:04 -0700] "GET /random/video/Star_Wars_Kid.wmv HTTP/1.0" 206 176250 "-" "Mozilla/4.77 (Macintosh; U; PPC)"

Our first task is to leave that arcane format behind and extract healthy structured models. Since every line stands alone, the parse script is simple as can be: a transform-only script that passes each line to the Logline.parse method and emits the model object it returns.

link:code/serverlogs/parser--processor.rb[role=include]

Star Wars Kid serverlogs

For sample data, we’ll use the webserver logs released by blogger Andy Baio. In 2003, he posted the famous "Star Wars Kid" video, which for several years ranked as the biggest viral video of all time. (It augments a teenager’s awkward Jedi-style fighting moves with the special effects of a real lightsaber.) Here’s his description:

I’ve decided to release the first six months of server logs from the meme’s spread into the public domain — with dates, times, IP addresses, user agents, and referer information. … On April 29 at 4:49pm, I posted the video, renamed to "Star_Wars_Kid.wmv" — inadvertently giving the meme its permanent name. (Yes, I coined the term "Star Wars Kid." It’s strange to think it would’ve been "Star Wars Guy" if I was any lazier.) From there, for the first week, it spread quickly through news site, blogs and message boards, mostly oriented around technology, gaming, and movies. …

This file is a subset of the Apache server logs from April 10 to November 26, 2003. It contains every request for my homepage, the original video, the remix video, the mirror redirector script, the donations spreadsheet, and the seven blog entries I made related to Star Wars Kid. I included a couple weeks of activity before I posted the videos so you can determine the baseline traffic I normally received to my homepage. The data is public domain. If you use it for anything, please drop me a note!

The details of parsing are mostly straightforward — we use a regular expression to pick apart the fields in each line. That regular expression, however, is another story:

link:code/serverlogs/models/logline--parse.rb[role=include]

It may look terrifying, but taken piece-by-piece it’s not actually that bad. Regexp-fu is an essential skill for data science in practice — you’re well advised to walk through it. Let’s do so.

The meat of each line describe the contents to match — \S+ for "a sequence of non-whitespace", \d+ for "a sequence of digits", and so forth. If you’re not already familiar with regular expressions at that level, consult the excellent tutorial at regular-expressions.info.
This is an 'extended-form' regular expression, as requested by the x at the end of it. An extended-form regexp ignores whitespace and treats # as a comment delimiter — constructing a regexp this complicated would be madness otherwise. Be careful to backslash-escape spaces and hash marks.
The \A and \z anchor the regexp to the absolute start and end of the string respectively.
Fields are selected using 'named capture group' syntax: (?<ip_address>\S+). You can retrieve its contents using match[:ip_address], or get all of them at once using captures_hash as we do in the parse method.
Build your regular expressions to be 'good brittle'. If you only expect HTTP request methods to be uppercase strings, make your program reject records that are otherwise. When you’re processing billions of events, those one-in-a-million deviants start occurring thousands of times.

That regular expression does almost all the heavy lifting, but isn’t sufficient to properly extract the requested_at time. Wukong models provide a "security gate" for each field in the form of the receive_(field name) method. The setter method (requested_at=) applies a new value directly, but the receive_requested_at method is expected to appropriately validate and transform the given value. The default method performs simple 'do the right thing'-level type conversion, sufficient to (for example) faithfully load an object from a plain JSON hash. But for complicated cases you’re invited to override it as we do here.

link:/Users/flip/ics/book/big_data_for_chimps/code/serverlogs/models/logline--requested_at.rb[role=include]

There’s a general lesson here for data-parsing scripts. Don’t try to be a hero and get everything done in one giant method. The giant regexp just coarsely separates the values; any further special handling happens in isolated methods.

Test the script in local mode:

link:code/serverlogs/output-parser-map.log[role=include]

Then run it on the full dataset to produce the starting point for the rest of our work:

TODO

Geo-IP Matching

You can learn a lot about your site’s audience in aggregate by mapping IP addresses to geolocation. Not just in itself, but joined against other datasets, like census data, store locations, weather and time. ^[1]

Maxmind makes their GeoLite IP-to-geo database available under an open license (CC-BY-SA)^[2]. Out of the box, its columns are beg_ip, end_ip, location_id, where the first two columns show the low and high ends (inclusive) of a range that maps to that location. Every address lies in at most one range; locations may have multiple ranges.

This arrangement caters to range queries in a relational database, but isn’t suitable for our needs. A single IP-geo block can span thousands of addresses.

To get the right locality, take each range and break it at some block level. Instead of having 1.2.3.4 to 1.2.5.6 on one line, let’s use the first three quads (first 24 bits) and emit rows for 1.2.3.4 to 1.2.3.255, 1.2.4.0 to 1.2.4.255, and 1.2.5.0 to 1.2.5.6. This lets us use the first segment as the partition key, and the full ip address as the sort key.

    lines            bytes  description                  file
15_288_766   1_094_541_688  24-bit partition key         maxmind-geolite_city-20121002.tsv
 2_288_690     183_223_435  16-bit partition key         maxmind-geolite_city-20121002-16.tsv
 2_256_627      75_729_432  original (not denormalized)  GeoLiteCity-Blocks.csv

Range Queries

////Gently introduce the concept. "So, here’s what range queries are all about, in a nutshell…" Amy////

This is a generally-applicable approach for doing range queries.

Choose a regular interval, fine enough to avoid skew but coarse enough to avoid ballooning the dataset size.
Whereever a range crosses an interval boundary, split it into multiple records, each filling or lying within a single interval.
Emit a compound key of [interval, join_handle, beg, end], where
- interval is
- join_handle identifies the originating table, so that records are grouped for a join (this is what ensures If the interval is transparently a prefix of the index (as it is here), you can instead just ship the remainder: [interval, join_handle, beg_suffix, end_suffix].
Use the

In the geodata section, the "quadtile" scheme is (if you bend your brain right) something of an extension on this idea — instead of splitting ranges on regular intervals, we’ll split regions on a regular grid scheme.

Using Hadoop for website stress testing ("Benign DDos")

Hadoop is engineered to consume the full capacity of every available resource up to the currently-limiting one. So in general, you should never issue requests against external services from a Hadoop job — one-by-one queries against a database; crawling web pages; requests to an external API. The resulting load spike will effectively be attempting what web security folks call a "DDoS", or distributed denial of service attack.

Unless of course you are trying to test a service for resilience against an adversarial DDoS — in which case that assault is a feature, not a bug!

elephant_stampede

        require 'faraday'

        processor :elephant_stampede do

	  def process(logline)
	    beg_at = Time.now.to_f
	    resp = Faraday.get url_to_fetch(logline)
	    yield summarize(resp, beg_at)
	  end

	  def summarize(resp, beg_at)
	    duration = Time.now.to_f - beg_at
	    bytesize = resp.body.bytesize
	    { duration: duration, bytesize: bytesize }
	  end

	  def url_to_fetch(logline)
	    logline.url
	  end
	end

	flow(:mapper){ input > parse_loglines > elephant_stampede }

You must use Wukong’s eventmachine bindings to make more than one simultaneous request per mapper.

Refs

Database of Robot User Agent strings
Improving Web Search Results Using Affinity Graph

1. These databases only impute a coarse-grained estimate of each visitor’s location — they hold no direct information about the persom. Please consult your priest/rabbi/spirit guide/grandmom or other appropriate moral compass before diving too deep into the world of unmasking your site’s guests.

2. For serious use, there are professional-grade datasets from Maxmind, Quova, Digital Element among others.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

10-event_streams.asciidoc

10-event_streams.asciidoc

Event Streams

Webserver Log Parsing

Simple Log Parsing

Geo-IP Matching

Range Queries

Using Hadoop for website stress testing ("Benign DDos")

Refs

Files

10-event_streams.asciidoc

Latest commit

History

10-event_streams.asciidoc

File metadata and controls

Event Streams

Webserver Log Parsing

Simple Log Parsing

Geo-IP Matching

Range Queries

Using Hadoop for website stress testing ("Benign DDos")

Refs