Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

fixing file references

  • Loading branch information...
commit dfbe291290398f689c05b5ed91cc38b688886434 1 parent 93dad37
Philip (flip) Kromer authored
Showing with 24 additions and 16 deletions.
  1. +2 −2 07a-serverlog_parsing.asciidoc
  2. +6 −6 07b-pageview_histograms.asciidoc
  3. +4 −4 07c-sessionizing.asciidoc
  4. +2 −2 07d-page_page_similarity.asciidoc
  5. +0 −2  07e-geo_ip_matching.asciidoc
  6. 0  code/serverlogs/old/{server_logs-00-model-base.rb → logline-00-model-base.rb}
  7. 0  code/serverlogs/old/{server_logs-00-model-day_hr.rb → logline-00-model-day_hr.rb}
  8. 0  code/serverlogs/old/{server_logs-00-model-detail.rb → logline-00-model-detail.rb}
  9. 0  code/serverlogs/old/{server_logs-00-model-parse.rb → logline-00-model-parse.rb}
  10. 0  code/serverlogs/old/{server_logs-00-model-regexp.rb → logline-00-model-regexp.rb}
  11. 0  code/serverlogs/old/{server_logs-02-histograms-01-mapper.log → logline-02-histograms-01-mapper.log}
  12. 0  ...rver_logs-02-histograms-02-mapper-wu-lign-sort.log → logline-02-histograms-02-mapper-wu-lign-sort.log}
  13. 0  code/serverlogs/old/{server_logs-02-histograms-03-reduce.log → logline-02-histograms-03-reduce.log}
  14. 0  code/serverlogs/old/{server_logs-02-histograms-04-freals.log → logline-02-histograms-04-freals.log}
  15. 0  code/serverlogs/old/{server_logs-03-breadcrumbs-02-mapper.log → logline-03-breadcrumbs-02-mapper.log}
  16. 0  code/serverlogs/old/{server_logs-03-breadcrumbs-03-reducer.log → logline-03-breadcrumbs-03-reducer.log}
  17. 0  ...erlogs/old/{server_logs-04-page_page_edges-03-reducer.log → logline-04-page_page_edges-03-reducer.log}
  18. +10 −0 code/serverlogs/parser--processor.rb
View
4 07a-serverlog_parsing.asciidoc
@@ -5,7 +5,7 @@ We'll represent loglines with the following <<serverlog_logline_fields,model def
[[serverlog_logline_fields]]
----
-include::code/serverlogs/model/logline--fields.rb[]
+include::code/serverlogs/models/logline--fields.rb[]
----
Since most of our questions are about what visitors do, we'll mainly use `visitor_id` (to identify common requests for a visitor), `uri_str` (what they requested), `requested_at` (when they requested it) and `referer` (the page they clicked on). Don't worry if you're not deeply familiar with the rest of the fields in our model -- they'll become clear in context.
@@ -43,7 +43,7 @@ ____
The details of parsing are mostly straightforward -- we use a regular expression to pick apart the fields in each line. That regular expression, however, is another story:
----
-include::code/serverlogs/logline-00-model-regexp.rb[]
+include::code/serverlogs/models/logline--parse.rb[]
----
It may look terrifying, but taken piece-by-piece it's not actually that bad. Regexp-fu is an essential skill for data science in practice -- you're well advised to walk through it. Let's do so.
View
12 07b-pageview_histograms.asciidoc
@@ -3,13 +3,13 @@
Let's start exploring the dataset. Andy Baio
----
-include::code/serverlogs/logline-02-histograms-mapper.rb[]
+include::code/serverlogs/old/logline-02-histograms-mapper.rb[]
----
We want to group on `date_hr`, so just add a 'virtual accessor' -- a method that behaves like an attribute but derives its value from another field:
----
-include::code/serverlogs/logline-00-model-date_hr.rb[]
+include::code/serverlogs/old/logline-00-model-date_hr.rb[]
----
This is the advantage of having a model and not just a passive sack of data.
@@ -17,7 +17,7 @@ This is the advantage of having a model and not just a passive sack of data.
Run it in map mode:
----
-include::code/serverlogs/logline-02-histograms-02-mapper-wu-lign-sort.log[]
+include::code/serverlogs/old/logline-02-histograms-02-mapper-wu-lign-sort.log[]
----
TODO: digression about `wu-lign`.
@@ -25,17 +25,17 @@ TODO: digression about `wu-lign`.
Sort and save the map output; then write and debug your reducer.
----
-include::code/serverlogs/logline-02-histograms-full.rb[]
+include::code/serverlogs/old/logline-02-histograms-full.rb[]
----
When things are working, this is what you'll see. Notice that the `.../Star_Wars_Kid.wmv` file already have five times the pageviews as the site root (`/`).
----
-include::code/serverlogs/logline-02-histograms-03-reduce.log[]
+include::code/serverlogs/old/logline-02-histograms-03-reduce.log[]
----
You're ready to run the script in the cloud! Fire it off and you'll see dozens of workers start processing the data.
----
-include::code/serverlogs/logline-02-histograms-04-freals.log[]
+include::code/serverlogs/old/logline-02-histograms-04-freals.log[]
----
View
8 07c-sessionizing.asciidoc
@@ -9,7 +9,7 @@ NOTE:[Take a moment and think about the locality: what feature(s) do we need to
spit out `[ip, date_hr, visit_time, path]`.
----
-include::code/serverlogs/logline-03-breadcrumbs-full.rb[]
+include::code/serverlogs/old/logline-03-breadcrumbs-full.rb[]
----
You might ask why we don't partition directly on say both `visitor_id` and date (or other time bucket). Partitioning by date would break the locality of any visitor session that crossed midnight: some of the requests would be in one day, the rest would be in the next day.
@@ -17,17 +17,17 @@ You might ask why we don't partition directly on say both `visitor_id` and date
run it in map mode:
----
-include::code/serverlogs/logline-02-histograms-01-mapper.log[]
+include::code/serverlogs/old/logline-02-histograms-01-mapper.log[]
----
----
-include::code/serverlogs/logline-03-breadcrumbs-02-mapper.log[]
+include::code/serverlogs/old/logline-03-breadcrumbs-02-mapper.log[]
----
group on user
----
-include::code/serverlogs/logline-03-breadcrumbs-03-reducer.log[]
+include::code/serverlogs/old/logline-03-breadcrumbs-03-reducer.log[]
----
We use the secondary sort so that each visit is in strict order of time within a session.
View
4 07d-page_page_similarity.asciidoc
@@ -4,11 +4,11 @@
What can you do with the sessionized logs? Well, each row lists a visitor-session on the left and a bunch of pages on the right. We've been thinking about that as a table, but it's also a graph -- actually, a bunch of graphs! The <<sidebar,serverlogs_affinity_graph>> describes an _affinity graph_, but we can build a simpler graph that just connects pages to pages by counting the number of times a pair of pages were visited by the same session. Every time a person requests the `/archive/2003/04/03/typo_pop.shtml` page _and_ the `/archive/2003/04/29/star_war.shtml` page in the same visit, that's one point towards their similarity. The chapter on <<graph_processing>> has lots of fun things to do with a graph like this, so for now we'll just lay the groundwork by computing the page-page similarity graph defined by visitor sessions.
----
-include::code/serverlogs/logline-04-page_page_edges-full.rb[]
+include::code/serverlogs/old/logline-04-page_page_edges-full.rb[]
----
----
-include::code/serverlogs/logline-04-page_page_edges-03-reducer.log[]
+include::code/serverlogs/old/logline-04-page_page_edges-03-reducer.log[]
----
[[serverlogs_affinity_graph]]
View
2  07e-geo_ip_matching.asciidoc
@@ -1,6 +1,4 @@
-
-
=== Geo-IP Matching ===
[[range_query]]
View
0  code/serverlogs/old/server_logs-00-model-base.rb → code/serverlogs/old/logline-00-model-base.rb
File renamed without changes
View
0  code/serverlogs/old/server_logs-00-model-day_hr.rb → code/serverlogs/old/logline-00-model-day_hr.rb
File renamed without changes
View
0  code/serverlogs/old/server_logs-00-model-detail.rb → code/serverlogs/old/logline-00-model-detail.rb
File renamed without changes
View
0  code/serverlogs/old/server_logs-00-model-parse.rb → code/serverlogs/old/logline-00-model-parse.rb
File renamed without changes
View
0  code/serverlogs/old/server_logs-00-model-regexp.rb → code/serverlogs/old/logline-00-model-regexp.rb
File renamed without changes
View
0  ...rlogs/old/server_logs-02-histograms-01-mapper.log → ...erverlogs/old/logline-02-histograms-01-mapper.log
File renamed without changes
View
0  ...ver_logs-02-histograms-02-mapper-wu-lign-sort.log → .../logline-02-histograms-02-mapper-wu-lign-sort.log
File renamed without changes
View
0  ...rlogs/old/server_logs-02-histograms-03-reduce.log → ...erverlogs/old/logline-02-histograms-03-reduce.log
File renamed without changes
View
0  ...rlogs/old/server_logs-02-histograms-04-freals.log → ...erverlogs/old/logline-02-histograms-04-freals.log
File renamed without changes
View
0  ...logs/old/server_logs-03-breadcrumbs-02-mapper.log → ...rverlogs/old/logline-03-breadcrumbs-02-mapper.log
File renamed without changes
View
0  ...ogs/old/server_logs-03-breadcrumbs-03-reducer.log → ...verlogs/old/logline-03-breadcrumbs-03-reducer.log
File renamed without changes
View
0  ...old/server_logs-04-page_page_edges-03-reducer.log → ...ogs/old/logline-04-page_page_edges-03-reducer.log
File renamed without changes
View
10 code/serverlogs/parser--processor.rb
@@ -0,0 +1,10 @@
+class ApacheLogParser < Wukong::Streamer::Base
+ include Wukong::Streamer::EncodingCleaner
+
+ def process(rawline)
+ logline = Logline.parse(rawline)
+ yield [logline.to_tsv]
+ end
+end
+
+Wukong.run( ApacheLogParser )
Please sign in to comment.
Something went wrong with that request. Please try again.