Browse files

fixing file references

  • Loading branch information...
1 parent 93dad37 commit dfbe291290398f689c05b5ed91cc38b688886434 Philip (flip) Kromer committed Mar 6, 2013
@@ -5,7 +5,7 @@ We'll represent loglines with the following <<serverlog_logline_fields,model def
Since most of our questions are about what visitors do, we'll mainly use `visitor_id` (to identify common requests for a visitor), `uri_str` (what they requested), `requested_at` (when they requested it) and `referer` (the page they clicked on). Don't worry if you're not deeply familiar with the rest of the fields in our model -- they'll become clear in context.
@@ -43,7 +43,7 @@ ____
The details of parsing are mostly straightforward -- we use a regular expression to pick apart the fields in each line. That regular expression, however, is another story:
It may look terrifying, but taken piece-by-piece it's not actually that bad. Regexp-fu is an essential skill for data science in practice -- you're well advised to walk through it. Let's do so.
@@ -3,39 +3,39 @@
Let's start exploring the dataset. Andy Baio
We want to group on `date_hr`, so just add a 'virtual accessor' -- a method that behaves like an attribute but derives its value from another field:
This is the advantage of having a model and not just a passive sack of data.
Run it in map mode:
TODO: digression about `wu-lign`.
Sort and save the map output; then write and debug your reducer.
When things are working, this is what you'll see. Notice that the `.../Star_Wars_Kid.wmv` file already have five times the pageviews as the site root (`/`).
You're ready to run the script in the cloud! Fire it off and you'll see dozens of workers start processing the data.
@@ -9,25 +9,25 @@ NOTE:[Take a moment and think about the locality: what feature(s) do we need to
spit out `[ip, date_hr, visit_time, path]`.
You might ask why we don't partition directly on say both `visitor_id` and date (or other time bucket). Partitioning by date would break the locality of any visitor session that crossed midnight: some of the requests would be in one day, the rest would be in the next day.
run it in map mode:
group on user
We use the secondary sort so that each visit is in strict order of time within a session.
@@ -4,11 +4,11 @@
What can you do with the sessionized logs? Well, each row lists a visitor-session on the left and a bunch of pages on the right. We've been thinking about that as a table, but it's also a graph -- actually, a bunch of graphs! The <<sidebar,serverlogs_affinity_graph>> describes an _affinity graph_, but we can build a simpler graph that just connects pages to pages by counting the number of times a pair of pages were visited by the same session. Every time a person requests the `/archive/2003/04/03/typo_pop.shtml` page _and_ the `/archive/2003/04/29/star_war.shtml` page in the same visit, that's one point towards their similarity. The chapter on <<graph_processing>> has lots of fun things to do with a graph like this, so for now we'll just lay the groundwork by computing the page-page similarity graph defined by visitor sessions.
@@ -1,6 +1,4 @@
=== Geo-IP Matching ===
@@ -0,0 +1,10 @@
+class ApacheLogParser < Wukong::Streamer::Base
+ include Wukong::Streamer::EncodingCleaner
+ def process(rawline)
+ logline = Logline.parse(rawline)
+ yield [logline.to_tsv]
+ end
+ ApacheLogParser )

0 comments on commit dfbe291

Please sign in to comment.