Permalink
Browse files

fixing file references

  • Loading branch information...
1 parent 93dad37 commit dfbe291290398f689c05b5ed91cc38b688886434 Philip (flip) Kromer committed Mar 6, 2013
@@ -5,7 +5,7 @@ We'll represent loglines with the following <<serverlog_logline_fields,model def
[[serverlog_logline_fields]]
----
-include::code/serverlogs/model/logline--fields.rb[]
+include::code/serverlogs/models/logline--fields.rb[]
----
Since most of our questions are about what visitors do, we'll mainly use `visitor_id` (to identify common requests for a visitor), `uri_str` (what they requested), `requested_at` (when they requested it) and `referer` (the page they clicked on). Don't worry if you're not deeply familiar with the rest of the fields in our model -- they'll become clear in context.
@@ -43,7 +43,7 @@ ____
The details of parsing are mostly straightforward -- we use a regular expression to pick apart the fields in each line. That regular expression, however, is another story:
----
-include::code/serverlogs/logline-00-model-regexp.rb[]
+include::code/serverlogs/models/logline--parse.rb[]
----
It may look terrifying, but taken piece-by-piece it's not actually that bad. Regexp-fu is an essential skill for data science in practice -- you're well advised to walk through it. Let's do so.
@@ -3,39 +3,39 @@
Let's start exploring the dataset. Andy Baio
----
-include::code/serverlogs/logline-02-histograms-mapper.rb[]
+include::code/serverlogs/old/logline-02-histograms-mapper.rb[]
----
We want to group on `date_hr`, so just add a 'virtual accessor' -- a method that behaves like an attribute but derives its value from another field:
----
-include::code/serverlogs/logline-00-model-date_hr.rb[]
+include::code/serverlogs/old/logline-00-model-date_hr.rb[]
----
This is the advantage of having a model and not just a passive sack of data.
Run it in map mode:
----
-include::code/serverlogs/logline-02-histograms-02-mapper-wu-lign-sort.log[]
+include::code/serverlogs/old/logline-02-histograms-02-mapper-wu-lign-sort.log[]
----
TODO: digression about `wu-lign`.
Sort and save the map output; then write and debug your reducer.
----
-include::code/serverlogs/logline-02-histograms-full.rb[]
+include::code/serverlogs/old/logline-02-histograms-full.rb[]
----
When things are working, this is what you'll see. Notice that the `.../Star_Wars_Kid.wmv` file already have five times the pageviews as the site root (`/`).
----
-include::code/serverlogs/logline-02-histograms-03-reduce.log[]
+include::code/serverlogs/old/logline-02-histograms-03-reduce.log[]
----
You're ready to run the script in the cloud! Fire it off and you'll see dozens of workers start processing the data.
----
-include::code/serverlogs/logline-02-histograms-04-freals.log[]
+include::code/serverlogs/old/logline-02-histograms-04-freals.log[]
----
@@ -9,25 +9,25 @@ NOTE:[Take a moment and think about the locality: what feature(s) do we need to
spit out `[ip, date_hr, visit_time, path]`.
----
-include::code/serverlogs/logline-03-breadcrumbs-full.rb[]
+include::code/serverlogs/old/logline-03-breadcrumbs-full.rb[]
----
You might ask why we don't partition directly on say both `visitor_id` and date (or other time bucket). Partitioning by date would break the locality of any visitor session that crossed midnight: some of the requests would be in one day, the rest would be in the next day.
run it in map mode:
----
-include::code/serverlogs/logline-02-histograms-01-mapper.log[]
+include::code/serverlogs/old/logline-02-histograms-01-mapper.log[]
----
----
-include::code/serverlogs/logline-03-breadcrumbs-02-mapper.log[]
+include::code/serverlogs/old/logline-03-breadcrumbs-02-mapper.log[]
----
group on user
----
-include::code/serverlogs/logline-03-breadcrumbs-03-reducer.log[]
+include::code/serverlogs/old/logline-03-breadcrumbs-03-reducer.log[]
----
We use the secondary sort so that each visit is in strict order of time within a session.
@@ -4,11 +4,11 @@
What can you do with the sessionized logs? Well, each row lists a visitor-session on the left and a bunch of pages on the right. We've been thinking about that as a table, but it's also a graph -- actually, a bunch of graphs! The <<sidebar,serverlogs_affinity_graph>> describes an _affinity graph_, but we can build a simpler graph that just connects pages to pages by counting the number of times a pair of pages were visited by the same session. Every time a person requests the `/archive/2003/04/03/typo_pop.shtml` page _and_ the `/archive/2003/04/29/star_war.shtml` page in the same visit, that's one point towards their similarity. The chapter on <<graph_processing>> has lots of fun things to do with a graph like this, so for now we'll just lay the groundwork by computing the page-page similarity graph defined by visitor sessions.
----
-include::code/serverlogs/logline-04-page_page_edges-full.rb[]
+include::code/serverlogs/old/logline-04-page_page_edges-full.rb[]
----
----
-include::code/serverlogs/logline-04-page_page_edges-03-reducer.log[]
+include::code/serverlogs/old/logline-04-page_page_edges-03-reducer.log[]
----
[[serverlogs_affinity_graph]]
@@ -1,6 +1,4 @@
-
-
=== Geo-IP Matching ===
[[range_query]]
@@ -0,0 +1,10 @@
+class ApacheLogParser < Wukong::Streamer::Base
+ include Wukong::Streamer::EncodingCleaner
+
+ def process(rawline)
+ logline = Logline.parse(rawline)
+ yield [logline.to_tsv]
+ end
+end
+
+Wukong.run( ApacheLogParser )

0 comments on commit dfbe291

Please sign in to comment.